Methods and systems for dna data storage

ABSTRACT

In various embodiments, an information storage system comprises: a writing device for synthesizing a nucleotide sequence that encodes a set of information; and a reading device for interpreting the nucleotide sequence by decoding the interpreted nucleotide sequence into the set of information, wherein the reading device comprises a molecular electronics sensor, the sensor comprising a pair of spaced apart electrodes and a molecular complex attached to each electrode to form a molecular electronics circuit, wherein the molecular complex comprises a bridge molecule and a probe molecule, and wherein the molecular electronics sensor produces distinguishable signals in a measurable electrical parameter of the molecular electronics sensor, when interpreting the nucleotide sequence.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/477,106, filed Jul. 10, 2019 (now U.S. Pat. No. 10,902,939), andentitled, “Methods and Systems for DNA Data Storage”, which is a U.S.national phase filing under 35 U.S.C. § 371 of PCT/US2018/013140 filedon Jan. 10, 2018, which claims priority to, and the benefit of, U.S.Provisional Patent Application Ser. No. 62/444,656, filed Jan. 10, 2017and entitled “Methods, Apparatus and Systems for DNA Data Storage,” andU.S. Provisional patent Application Ser. No. 62/547,692, filed Aug. 18,2017 and entitled “Molecular Electronics Sensors for DNA Data Storage,”the disclosures of which are incorporated herein by reference in theirentireties.

FIELD

The present disclosure generally relates to electronic data storage andretrieval, and more particularly to a DNA information storage andretrieval system comprising molecular sensors for reading DNA sequencesand encoder/decoder algorithms for DNA sequence-binary conversions.

BACKGROUND

The advent of digital computing in the 20^(th) Century created the needfor archival storage of large amounts of digital or binary data.Archival storage is intended to house data for long periods of time,e.g., years, decades or longer, in a way that is very low cost, and thatsupports the rare need to re-access the data. Although an archivalstorage system may feature the ability to hold unlimited amounts of dataat very low cost, such as through a physical storage medium able toremain dormant for long periods of time, the data writing and recoveryin such a system can be the relatively slow or otherwise costlyprocesses. The dominant forms of archival digital data storage that havebeen developed to date include magnetic tape, and, more recently,compact optical disc (CD). However, as data production grows, there is aneed for even higher density, lower cost, and longer lasting archivaldigital data storage systems.

It has been observed that in biology, the genomic DNA of livingorganisms functions as a form of digital information archival storage.On the timescale of the existence of a species, which may extend forthousands to millions of years, the genomic DNA in effect stores thegenetic biological information that defines the species. The complexenzymatic, biochemical processes embodied in the biology, reproductionand survival of the species provide the means of writing, reading andmaintaining this information archive. This observation has motivated theidea that perhaps the fundamental information storage capacity of DNAcould be harnessed as the basis for high density, long duration archivalstorage of more general forms of digital information.

What makes DNA attractive for information storage is the extremely highinformation density resulting from molecular scale storage ofinformation. In theory for example, all human-produced digitalinformation recorded to date, estimated to be approximately 1 ZB(ZettaByte) (˜10²¹ Bytes), could be recorded in less than 10²² DNAbases, or 1/60^(th) of a mole of DNA bases, which would have a mass ofjust 10 grams. In addition to high data density, DNA is also a verystable molecule, which can readily last for thousands of years withoutsubstantial damage, and which could potentially last far longer, fortens of thousands of years, or even millions of years, such as observednaturally with DNA frozen in permafrost or encased in amber.

SUMMARY

In various embodiments, an information storage system is disclosed. Invarious aspects, the system comprises a DNA reading device, a digitaldata encoding/decoding algorithm, and a DNA writing device, wherein theproperties of these three elements are co-optimized to minimize orreduce various cost metrics and increase overall system performance. Invarious aspects, the co-optimization may comprise reducing the errorrate of the system, through balancing, avoiding, or correcting theerrors in DNA reading and writing. In other instances, theco-optimization may comprise reducing the DNA reading or writing time inthe system, e.g., by avoiding the use of slower speed DNA sequencemotifs, and/or by using error correction/avoidance to compensate forerrors incurred from rapid operation of the system.

In various embodiments of the present disclosure, a DNA data reader isprovided for use in a DNA data storage system. In particular, amolecular sensor is provided that can extract the digital informationsuitably encoded within a single DNA molecule. In certain aspects, suchsensors may be in a high-density chip-based format that can provide thehigh-throughput, low-cost, fast data extraction capability required forlarge scale DNA data storage systems. In various examples, the sensorfor reading the digital data stored in DNA molecules processesindividual encoded DNA molecules directly, so that there is no need forcomplicated sample preparation such as making copies of DNA or clonalpopulations. In various aspects of the system, data is stored directlyin synthetic DNA or DNA analogues that are synthetized with featuresbeneficial for digital data storage that cannot be replicated bystandard methods of copying DNA.

In various embodiments of a DNA data storage system, recovered data maybe stored in a great variety of DNA analogs or modified DNA molecules inaddition to native DNA, which provides greater choices of data writingsystems and more effective data storage systems. In various aspects ofthe system, the time required to extract information encoded in a DNAmolecule is short, e.g., on the order of seconds, which fundamentallyenables short turnaround times for data recovery. In various aspects,the system can perform well over a large range of DNA molecular lengths,e.g., from lengths as short as 10's of bases, to 100's of bases, to1000's of bases, and greater than tens of thousands of bases. Thisability provides greater flexibility in the choice of DNAwriting/synthesis technology, and eliminates the need to further prepareDNA samples prior to reading to meet length constraints in reading thedigital information.

In various aspects of the present disclosure, a molecular sensor for DNAsequence reading can be deployed in a highly scalable, low cost, CMOSchip format, providing for efficient mass manufacturing, and low costsystems and instruments, and overall low costs in reading digital datastored in DNA. In various aspects, systems and devices required to readExabyte-scale digital data from DNA data are highly compact and energyefficient in order to support practical, robust deployment locally aton-site data centers and to support highly scalable cloud-based archivaldata storage services.

In various aspects, reading of data stored in DNA in accordance to thepresent disclosure exceeds the performance, in speed, throughput andcost, in reading data archived in conventional archival storage formatssuch as magnetic tape or optical discs. An advantage of the present DNAdata storage system is that it provides enabling technology for DNAdigital data storage systems capable of practical Exabyte scale storage,and Zettabyte scale storage.

In various embodiments, the DNA writing device of a DNA Archival StorageSystem comprises a CMOS chip further comprising molecular electronicssensor devices. In other instances, the DNA writing device is a CMOSchip comprising voltage/current directed synthesis sites on pixelelectrodes.

In various embodiments, aspects of the archive operations, such as copy,append, targeted deletion, targeted reading, and searching throughmolecular biology procedures, as applied to a DNA storage archive systemare disclosed herein.

In various embodiments, an information storage system comprises: awriting device for synthesizing a nucleotide sequence that encodes a setof information; and a reading device for interpreting the nucleotidesequence by decoding the interpreted nucleotide sequence into the set ofinformation, wherein the reading device comprises a molecularelectronics sensor, the sensor comprising a pair of spaced apartelectrodes and a molecular complex attached to each electrode to form amolecular electronics circuit, wherein the molecular complex comprises abridge molecule and a probe molecule, and wherein the molecularelectronics sensor produces distinguishable signals in a measurableelectrical parameter of the molecular electronics sensor, wheninterpreting the nucleotide sequence.

In various aspects, the set of information comprises binary data. Incertain aspects, the nucleotide sequence comprises a DNA sequence. Forexample, the system provides binary data storage in the form of DNAmolecules, and provides for extraction of the archived data whenretrieval is desired.

In various embodiments, the system further comprises at least one oferror detecting schemes or error correction schemes within the DNAsequence. In certain aspects, the error detecting schemes are selectedfrom repetition code, parity bits, checksums, cyclic redundancy checks,cryptographic hash functions and hamming codes, and the error correctionschemes are selected from automatic repeat request, convolutional codes,block codes, hybrid automatic repeat request and Reed-Solomon codes.

In various embodiments, the writing device of the system comprises aCMOS chip based array of actuator pixels for DNA synthesis, the actuatorpixels directing voltage/current or light-mediated deprotection within aDNA synthesis reaction comprising a phosphoramidite or ligationchemistries.

In various embodiments, the probe molecule comprises a polymeraseenzyme, and wherein the measurable electrical parameter of the sensor ismodulated by enzymatic activity of the polymerase enzyme. The polymeraseenzyme may comprise a native polymerase enzyme or a geneticallyengineered polymerase enzyme selected from Klenow, Phi29, TAQ, BST, T7,or a reverse transcriptase.

In various embodiments, the reading device of the system furthercomprises a buffer solution, operating parameters for measuring themeasurable electrical parameter, and two or more sequence segments of aDNA template molecule, that, when processed by the polymerase, producethe distinguishable signals in the measurable electrical parameter whenperformed in the conditions provided by the buffer solution and theoperating parameters. In certain aspects, the buffer solution comprisesmodified dNTPs. In various aspects, the sequence segments of the DNAtemplate molecule that produce the distinguishable signals comprise anyone or combination of different DNA bases, modified DNA bases, DNA baseanalogues, multi-base sequences or motifs, or homopolymer runs of DNAbases.

In various embodiments, the measurable electrical parameter of thesensor comprises a source-drain current between the spaced apartelectrodes and through the molecular complex. The molecular electronicssensor may be part of a CMOS sensor array chip further comprising aplurality of molecular electronics sensors and supporting pixelcircuitry that performs measurement of the measurable electricalparameter.

In various embodiments, the molecular electronics sensor furthercomprises a gate electrode adjacent the spaced apart electrodes. Invarious aspects, the bridge molecule of a sensor in the system comprisesa double stranded DNA oligomer, a protein alpha helix, a graphenenanoribbon, a carbon nanotube, an antibody, or a Fab arm of an antibody.

In various embodiments, a method of interpreting a set of informationencoded in a nucleotide sequence is disclosed. The method comprises:supplying the nucleotide sequence to a molecular electronics sensorcapable of producing distinguishable signals in a measurable electricalparameter of the molecular electronics sensor, relating to the set ofinformation; generating the distinguishable signals; and converting thedistinguishable signals into the set of information, wherein themolecular electronics sensor comprises a pair of spaced apart electrodesand a molecular complex attached to each electrode to form a molecularelectronics circuit, wherein the molecular complex comprises a bridgemolecule and a probe molecule. In various aspects, the set ofinformation comprises binary data. In certain aspects, the nucleotidesequence comprises a DNA sequence.

In various embodiments, a method of encoding a set of information into anucleotide sequence is disclosed. The method comprises: providing a setof information; converting the set of information into one or morepredetermined nucleotides capable of generating distinguishable signalsin a measurable electrical parameter of a molecular electronics sensor,using an encoding scheme; and assembling the one or more nucleotidesinto the nucleotide sequence. In various aspects, the one or morepredetermined nucleotides capable of generating distinguishable signalscomprise nucleotides that are resistant to secondary structure formationcompared to a variant of the same nucleotides.

In various embodiments, converting the set of information into anucleotide sequence comprises use of a binary encoding scheme (denotedherein as “BES”). In various examples, the BES comprises any one or moreof BES1, BES2, BES3, BES4, BES5 and BES6.

In various embodiments, the molecular electronics sensor in the methodcomprises a pair of spaced apart electrodes and a molecular complexattached to each electrode to form a molecular electronics circuit,wherein the molecular complex comprises a bridge molecule and a probemolecule, and wherein the molecular electronics sensor produces thedistinguishable signals in a measurable electrical parameter of themolecular electronics sensor when interpreting the nucleotide sequence.In various aspects, the set of information in the method comprisesbinary data.

BRIEF DESCRIPTION OF THE DRAWING FIGURES

FIG. 1 illustrates elements of an embodiment of a DNA informationstorage system;

FIG. 2 illustrates a generalized view of primary DNA storage systeminformation phases and processes;

FIG. 3 illustrates the need for error compensation by diagrammaticallyshowing the perfect process versus a process wherein errors areintroduced in reading and writing DNA;

FIG. 4 illustrates an embodiment of ideal error compensation;

FIG. 5 illustrates cost-conscious management of errors in a DNAinformation storage system;

FIG. 6 illustrates an ideal error aware system;

FIG. 7 illustrates an example of 2-way error compensation;

FIG. 8 illustrates an example of 2-way cost compensation for generalcost reduction/optimization;

FIG. 9 illustrates exemplary factors of a cost-optimized encodingsystem;

FIG. 10A illustrates the basic concept of a molecular electronic sensingcircuit sensing the interaction of interacting molecules with a sensormolecule complex;

FIG. 10B illustrates an embodiment of a polymerase-based molecularsensor used as a reader for data encoded into synthetic DNA molecules;

FIG. 11 illustrates an embodiment of the polymerase-based sensor of FIG.10B, where the polymerase is conjugated to a bridging molecule spanningthe electrodes;

FIG. 12 illustrates an embodiment of a sensor circuit wherein thepolymerase is conjugated directly in the current path, and wherein twoarm molecules provide connection to the electrodes;

FIG. 13 illustrates an embodiment of a sensor where the polymerase isdirectly conjugated to the electrodes, with no arm or bridge molecules;

FIG. 14 shows the 3D detailed protein structure of one specificpolymerase molecule, the Klenow Fragment of E. Coli. Polymerase I;

FIG. 15 illustrates an embodiment of a molecular electronic sensorwherein the Klenow Fragment of E. Coli. Polymerase I is conjugated to abridge molecule that spans the gap between the electrodes;

FIG. 16 illustrates an embodiment of a molecular electronic sensor wherethe Klenow Fragment of E. Coli. Polymerase I is conjugated directly intothe current path through use of two arm molecules linking the polymeraseto the electrodes;

FIG. 17 illustrates an embodiment of a molecular electronic sensor wherethe Klenow Fragment of E. Coli Polymerase I is conjugated directly intothe current path and directly to the metal electrodes, with no arm orbridge molecules;

FIG. 18 illustrates an embodiment of a molecular sensor usable as a DNAreader device in various aspects of the present DNA data storage system;

FIG. 19 illustrates a schematic of a test set-up for electricalmeasurements on molecular sensors for DNA sequence comprising amolecular sensor post-processed onto the pixels of a CMOS pixel array;

FIGS. 20A, 20B and 20C set forth three electron microscope (EM) imagesof electrodes (at increasing resolution) with gold metal dot contactsfor bridge binding in sensors;

FIG. 21 shows current (pA) versus time (sec) plots obtained by measuringDNA incorporation signals with the sensor of FIG. 18;

FIG. 22 illustrates the use of modified dNTPs to produce enhancedsectionsignals from each base as the polymerase processes a templateDNA, incorporating the modified dNTPs, in this case producing fourdistinguishable signal features;

FIG. 23 illustrates the use of two modified dNTPs, here modified dATP(A*) and dCTP (C*), to produce two distinguishable signals ofincorporation, which provides a means to encode binary bits 0/1 into thetemplate DNA;

FIG. 24 illustrates the use of two different sequence motifs, herehomopolymers AA and CCC, to produce two distinguishable signals, whichprovides a means to encode binary bits 0/1 into the template DNA. Inthis case, AA and CCC provide two distinguishable sequence motifs thatcould be used for information encoding and recovery. Thus usefulinformation encoding and reading is possible even without single baseresolution of DNA sequence, by instead relying on distinguishablesequence motifs;

FIG. 25 shows actual experimental data, produced by an embodiment of thesensor of FIG. 18 in which specific sequence motifs of poly A and poly Cproduced distinguishable signals that are usable to encode 0/1 binarydata;

FIG. 26A illustrates the use of two different sequence motifs, GATT andACA, to produce two distinguishable signals that provide a method toencode binary bits 0/1 into the template DNA;

FIG. 26B illustrates the use of three different sequence motifs, hereGATT, ACA and AGG, to produce three distinguishable signals, whichprovides a means to encode digital data with a three-state encoding;

FIG. 26C illustrates an embodiment where the encoding DNA is synthesizedwith base analogues, the standard modified bases X and Y, to produce 2distinguishable signals when processed by polymerase with standarddNTPs, which provides a means to encode digital data with 2 stateencoding, for example as in FIG. 17, BES4;

FIG. 26D illustrates an embodiment where the encoding DNA is synthesizedwith 8 different bases and base analogues, the standard A, C, G, T, andmodified bases X, Y, Z, W, to produce 8 distinguishable signals whenprocessed by polymerase with standard dNTPs, which provides a means toencode digital data with an 8 state encoding, for example as shown byscheme “BES5” in FIG. 29;

FIG. 27A sets forth different primer constructs that can be used ininformation-encoding DNA, such that templates can be presented inproperly primed form for engagement with the polymerase of the sensor;

FIG. 27B sets forth embodiments of DNA template strand architecturesthat enable a polymerase-based sensor to interrogate the same DNA datapayload multiple times;

FIG. 28 schematically illustrates how physical DNA structure relates tothe logical structure of a DNA data storage molecule;

FIG. 29 sets forth various embodiments of binary data encoding schemesusable to encode a binary data payload into a DNA sequence, includingexamples of Binary Encoding Schemes (“BES”) for representing a datapayload as DNA sequence, along with primary translations between digitaldata and encoding DNA sequences;

FIG. 30A illustrates an embodiment of a fabrication stack usable todispose individual DNA reader sensors on a chip in the form of massivelyparallel arrays;

FIG. 30B illustrates an embodiment of a high-level CMOS chip pixel arrayand details of a molecular electronics sensor circuit pixel in thearray;

FIG. 31 shows a circuit schematic and simulated measurement for thepixel circuit of FIG. 30;

FIG. 32 illustrates an embodiment of an annotated chip design layoutfile and an optical microscope image of the corresponding finished chipfor comparison;

FIG. 33 shows SEM images of the fabricated chip of FIG. 32, includinginsets of expanded resolution SEM images showing a pixel andnanoelectrodes with a polymerase molecular complex in place;

FIG. 34 shows a schematic of a complete system for reading DNA data withchip-based DNA reader sensors;

FIG. 35 shows a schematic of an embodiment of a cloud-based DNA dataarchival storage system, in which a multiplicity of the DNA readingsystem of FIG. 34 are aggregated to provide the data reader server;

FIG. 36 illustrates an alternate embodiment of a DNA data reader sensorin which a polymerase is complexed with a nanopore ion current sensor,producing distinguishable signal features in the nanopore ion currentwhile processing DNA;

FIG. 37 shows an embodiment of a DNA data reader sensor wherein apolymerase is complexed with a nanopore ion current sensor, in which thepolymerase is directly conjugated to the nanopore, and in which thedNTPs are modified by groups that interact with and alter the pore ioncurrent during incorporation;

FIG. 38 illustrates an embodiment of DNA data reader sensor in which apolymerase is complexed with a carbon nanotube molecular wire spanningpositive and negative electrodes, which produces distinguishable signalfeatures in the measured current passing through the carbon nanotube;and

FIG. 39 illustrates an embodiment of a Zero Mode Waveguide sensorcomplexed with a single polymerase, shown in cross section, thatproduces distinguishable optical signals corresponding to DNA features.

DETAILED DESCRIPTION

In various embodiments, a DNA data storage system utilizing DNAmolecules as a general purpose means of digital information storage isdisclosed. In certain aspects, a system for digital information storagecomprises a DNA reading device, an information encoder/decoderalgorithm, and a DNA writing device. In other aspects, the interrelationof these three elements and their co-optimization are disclosed.

In various embodiments, a data reader for a DNA data storage system isdisclosed. In various aspects, a DNA reading device comprises a sensorthat extracts information from a single DNA molecule. The sensor may bedeployed in a chip-based format. In various examples, data readingsystems that support such a chip-based sensor device are disclosed.

Definitions

As used herein, the term “DNA” refers to both biological DNA moleculesand synthetic versions, such as made by nucleotide phosphoramiditechemistry, ligation chemistry or other synthetic organic methodologies.DNA, as used herein, also refers to molecules comprising chemicalmodifications to the bases, sugar, and/or backbone, such as known tothose skilled in nucleic acid biochemistry. These include, but are notlimited to, methylated bases, adenylated bases, other epigeneticallymarked bases, and non-standard or universal bases such as inosine or3-nitropyrrole, or other nucleotide analogues, or ribobases, or abasicsites, or damaged sites. DNA also refers expansively to DNA analoguessuch as peptide nucleic acids (PNA), locked nucleic acids (LN A), andthe like, including the biochemically similar RNA molecule and itssynthetic and modified forms. All these biochemically closely relatedforms are implied by the use of the term DNA, in the context of the datastorage molecule used in a DNA data storage system herein. Further, theterm DNA herein includes single stranded forms, double helix ordouble-stranded forms, hybrid duplex forms, forms containing mismatchedor non-standard base pairings, non-standard helical forms such astriplex forms, and molecules that are partially double stranded, such asa single-stranded DNA bound to a oligo primer, or a molecule with ahairpin secondary structure. Generally as used herein, the term DNArefers to a molecule comprising a single-stranded component that can actas the template for a polymerase enzyme to synthesize a complementarystrand therefrom.

DNA sequences as written herein, such as GATTACA, refer to DNA in the 5′to 3′ orientation, unless specified otherwise. For example, GATTACA aswritten herein represents the single stranded DNA molecule5′-G-A-T-T-A-C-A-3′. In general, the convention used herein follows thestandard convention for written DNA sequences used in the field ofmolecular biology.

As used herein, the term “polymerase” refers to an enzyme that catalyzesthe formation of a nucleotide chain by incorporating DNA or DNAanalogues, or RNA or RNA analogues, against a template DNA or RNAstrand. The term polymerase includes, but is not limited to, wild-typeand mutant forms of DNA polymerases, such as Klenow, E. Coli Pol I, Bst,Taq, Phi29, and T7, wild-type and mutant forms of RNA polymerases, suchas T7 and RNA Pol I, and wild-type and mutant reverse transcriptasesthat operate on an RNA template to produce DNA, such as AMV and MMLV.

As used herein, the term “dNTP” refers to both the standard, naturallyoccurring nucleoside triphosphates used in biosynthesis of DNA (i.e.,dATP, dCTP, dGTP, and dTTP), and natural or synthetic analogues ormodified forms of these, including those that carry base modifications,sugar modifications, or phosphate group modifications, such as analpha-thiol modification or gamma phosphate modifications, or thetetra-, penta-, hexa- or longer phosphate chain forms, or any of theaforementioned with additional groups conjugated to any of thephosphates, such as the beta, gamma or higher order phosphates in thechain. In general, as used herein, “dNTP” refers to any nucleosidetriphosphate analogue or modified form that can be incorporated by apolymerase enzyme as it extends a primer, or that would enter the activepocket of such an enzyme and engage transiently as a trial candidate forincorporation.

As used herein, “buffer,” “buffer solution” and “reagent solution”refers to a solution which provides an environment in which thepolymerase sensor can operate and produce signals from suppliedtemplates. In various embodiments, the solution is aqueous. The buffer,buffer solution or reagent solution may comprise components such assalts, pH buffers, divalent cations, detergents, blocking agents,solvents, template primer oligos, proteins that complex with apolymerase, and polymerase substrates, (e.g., dNTPs, analogues ormodified forms of dNTPs, and DNA substrates or templates).

As used herein, “binary data” or “digital data” refers to data encodedusing the standard binary code, or a base 2 {0,1} alphabet, data encodedusing a hexadecimal base 16 alphabet, data encoded using the base 10{0-9} alphabet, data encoded using ASCII characters, or data encodedusing any other discrete alphabet of symbols or characters in a linearencoding fashion.

As used herein, “digital data encoded format” refers to a series ofbinary digits, or other symbolic digits or characters that come from theprimary translation of DNA sequence features used to encode informationin DNA, or the equivalent logical string of such classified DNAfeatures. In some embodiments, information to be archived as DNA may betranslated into binary, or may exist initially as binary data, and thenthis data may be further encoded with error correction and assemblyinformation, into the format that is directly translated into the codeprovided by the distinguishable DNA sequence features. This latterassociation is the primary encoding format of the information.Application of the assembly and error correction procedures is afurther, secondary level of decoding, back towards recovering the sourceinformation.

As used herein, “distinguishable DNA sequence features” means thosefeatures of a data-encoding DNA molecule that, when processed by asensor polymerase, produce distinct signals that can be used to encodeinformation. Such features may be, for example, different bases,different modified bases or base analogues, different sequences orsequence motifs, or combinations of such to achieve features thatproduce distinguishable signals when processed by a sensor polymerase.

As used herein, a “DNA sequence motif” refers to both a specific lettersequence or a pattern representing any member of a specific set of suchletter sequences. For example, the following are sequence motifs thatare specific letter sequences: GATTACA, TAC, or C. In contrast, thefollowing are sequence motifs that are patterns: G[A/T]A is a patternrepresenting the explicit set of sequences {GAA, GTA}, and G[2-5] is apattern referring to the set of sequences {GG, GGG, GGGG, GGGGG}. Theexplicit set of sequences in the unambiguous description of the motif,while such pattern shorthand notations as those are common compact waysof describing such sets. Motif sequences such as these may be describingnative DNA bases, or may be describing modified bases, in variouscontexts. In various contexts, the motif sequences may be describing thesequence of a template DNA molecule, and/or may be describing thesequence on the molecule that complements the template.

As used herein, “sequence motifs with distinguishable signals,” in thecases of patterns, means that there is a first motif patternrepresenting a first set of explicit sequences, and any of saidsequences produces the first signal, and there is a second motif patternrepresenting a second set of explicit sequences, and any of saidsequences produces the second signal, and the first signal isdistinguishable from the second signal. For example, if motif G[A/T]Aand motif G[3-5] produce distinguishable signals, it means that any ofthe set {GAA, GTA} produce a first signal, and any of the set{GGG,GGGG,GGGGG} produce a second signal, distinguishable from thefirst.

As used herein, “distinguishable signals” refers to one electricalsignal from a sensor being discernably different than another electricalsignal from the sensor, either quantitatively (e.g., peak amplitude,signal duration, and the like) or qualitatively, (e.g., peak shape, andthe like), such that the difference can be leveraged for a particularuse. In a non-limiting example, two current peaks versus time from anoperating molecular sensor are distinguishable if there is more thanabout a 1×10⁻¹⁰ Amp difference in their amplitudes. This difference issufficient to use the two peaks as two distinct binary bit readouts,e.g., a 0 and a 1. In some instances, a first peak may have a positiveamplitude, e.g., from about 1×10⁻¹⁰ Amp to about 20×10⁻¹⁰ Amp amplitude,whereas a second peak may have a negative amplitude, e.g., from about 0Amp to about −5×10⁻¹⁰ Amp amplitude, making these peaks discernablydifferent and usable to encode different binary bits, i.e., 0 or 1.

As used herein, a “data-encoding DNA molecule,” or “DNA data encodingmolecule,” refers to a molecule synthesized to encode data in DNA, orcopies or other DNA derived from such molecules.

As used herein, “reading data from DNA” refers to any method ofmeasuring the distinguishable signals that correspond to the DNAmolecular features that were used to encode information into the DNAmolecule.

As used herein, electrodes refer to nano-scale conducting metalelements, with a nanoscale-sized gap between two electrodes in anindividual pair of electrodes, and, in some embodiments, comprising agate electrode capacitively coupled to the gap region, which may be aburied or “back” gate, or a side gate. The electrodes may be referred toas “source” and “drain” electrodes in some contexts, or as “positive”and “negative” electrodes, such terminologies being common inelectronics. Nano-scale electrodes will have a gap width between eachelectrode in a pair of electrodes in the 1 nm-100 nm range, and willhave other critical dimensions, such as width and height and length,also in this range. Such nano-electrodes may comprise a variety ofmaterials that provide conductivity and mechanical stability, such asmetals, or semiconductors, for example, or of a combination of suchmaterials. Examples of metals for electrodes include titanium andchromium.

General aspects of a DNA data storage system in accordance to thepresent disclosure, usable for archiving and later accessing storeddata, are disclosed in reference to various drawing figures.

FIG. 1 illustrates an embodiment of a DNA information storage system inaccordance with the present disclosure. This example shows the majorelements of a DNA storage system, including the physical system used tohandle and maintain the DNA material during storage, and which carriesout operations on the stored archive, such as copying. An externalcomputer provides a high level control of the system, supplyinginformation for storage, and receiving extracted information.Information is encoded as DNA sequences, synthesized, stored, and thenread, decoded and output. In addition, such a system is capable ofphysical I/O of the DNA archive material samples as well.

FIG. 2 illustrates primary DNA storage system information phases andprocesses, including the major phases of information existing in theoverall system (depicted in FIG. 2 as boxes), along with the primaryoperations transitioning from one form to another (depicted in FIG. 2 byarrows).

In various aspects of the present disclosure, a DNA information storagesystem comprises: (a) an encoder/decoder; a DNA writing device; and aDNA reading device.

Encoder/Decoder: In various aspects, the encoder/decoder comprises analgorithm with two functions: the encoder portion translates givendigital/binary information into a specific set of DNA sequences that areinputs to the DNA writer. The decoder portion translates a given set ofDNA sequences of the type provided by the DNA reader, back into digitalinformation.

DNA writing Device: In various aspects, the DNA writing device comprisesany device that takes a given set of DNA sequences and synthesizes DNAmolecules from these sequences (see, for e.g.: Kosuri and Church, “LargeScale de novo DNA synthesis: technologies and applications. NatureMethods, 11: 499-509, 2014). Non-limiting examples of methods anddevices for synthesizing DNA molecules include commercial technologyoffered by Agilent Technologies and Twist BioScience. For each desiredsequence, multiple DNA molecules representing that sequence areproduced. The multiplicity of molecules produced can be in the ranges of10's, 100's, 1000's, millions or even billions of copies of DNAmolecules for each desired sequence. All of these copies representingall the desired sequences may be pooled into one master pool ofmolecules. It is typical of such DNA writing systems that the writing isnot perfect, and if N molecules are synthesized to represent a giveninput sequence, not all of these will actually realize the desiredsequence. For example, they may contain erroneous deletions, insertions,or incorrect or physically damaged bases.

DNA reading Device: In various aspects, the DNA reading device is adevice that takes a pool of DNA molecules and produces a set of measuredDNA sequences for molecules sampled or selected from this pool. Suchreaders actually survey only a small portion of the DNA moleculesintroduced into the system, so that only a small fraction will undergoan actual read attempt. It is further typical of such DNA readingdevices that a given DNA molecule that is processed may not be read withentire accuracy, and thus there may be errors present in the read. As aresult, it is also typical that the measured sequence outputs includevarious forms of confidence estimates and missing data indicators. Forexample, for each letter in a measure sequence, there may be aconfidence probability or odds that it is correct, versus the other theeDNA letter options, and there may be missing data indicators thatindicate the identity of a letter is unknown, or there may be a set ofoptional sequence candidates with different probabilities representing aportion of a read.

The three major elements of a DNA data storage system in accordance tothe present disclosure, as set forth above, have certain roles andinterrelations, as detailed further below.

The Relation Between Major Elements, and the Central Role of theEncoder/Decoder:

The information encoder/decoder is selected based on the properties ofthe DNA writer and DNA reader devices, so as to minimize or reduce someoverall measure of the cost of the information storage/retrievalprocess. One key component of system cost is the overall error rate inretrieved information. Errors and costs are diagrammatically illustratedin FIGS. 3-9.

In general, a DNA writer device can introduce writing errors, and a DNAreading device can produce reading errors, and so the processes ofstoring information in the system and then later retrieving itpotentially results in an error rate seen in the retrieved information.As diagrammatically illustrated in FIG. 3, there is a need to compensatefor errors in the system. FIG. 3 illustrates the need for errorcorrection, error avoidance, or some form of error compensation sinceencoding information followed by the decoding of the information willotherwise generally not result in the original information returned,primarily due to physiochemical errors in the reading and writing ofDNA. The encoder/decoder algorithm can be chosen to minimize or reducethis error rate, based on the error properties and propensities of theDNA reader and DNA writer. These embodiments are illustrated in FIGS.3-6.

FIG. 4 illustrates an embodiment of ideal error compensation. Idealerror compensation is an error compensation scheme that is aware oferror modes of the synthesizer and sequencer technologies. Errors arereduced and/or compensated through a combination of avoiding errorgeneration, and detection and correction of errors using knowledge ofthe error modes of both the DNA reading and DNA writing systems, andalso based on the observed data uncertainty, as reflected in empiricalquality scores for the written and read DNA sequences, as generated bythe reading and writing systems.

FIG. 5 illustrates cost-conscious management of the errors in a DNAinformation storage system. In general, certain DNA sequences can haveinherently greater cost, based on a cost function related to, forexample, error rate, time required, reagent consumption, financial cost,etc. In various aspects of the present disclosure, costs associated withthe reading and writing technologies are considered, and are reduced oroptimized be using an encoding/decoding scheme that minimizes costs ingeneral (e.g., by avoiding high cost, error prone, slow synthesis orslow to read DNA motifs).

In various embodiments, nucelotides can be preferentially selected forincorporation in nucleotide sequences based on their ease of synthesisin the writing process that forms molecules, reduced propensity to formsecondary structure in the synthesized molecules, and/or ease in readingduring the data decoding process. In various aspects, bad writing motifsand bad reading motifs are avoided in the selection of nucleotides forincorporation into nucleotide sequences, with a focus on incorporatingsegments in the nucleotide sequence that will produce mutuallydistinguishable signals when that nucleotide sequence is read to decodethe encoded information. For example, in reading a nucleotide sequence,A and T are mutually distinguishable, C and G are mutuallydistinguishable, A, C and G are mutually distinguishable, AAA and TT aremutually distinguishable, A, GG and ATA are mutually distinguishable,and C, G, AAA, TTTT, GTGTG are mutually distinguishable. These, and manyother sets of nucloeitide and nucleotide segments provide mutuallydistinguishable signals in a reader, and thus can be considered forincorporation in a nucleotide sequence when encoding a set ofinformation into a nucleotide sequence.

Additionally, there are nucleotide segments that are difficult to write,and thus should be avoided when encoding a set of information into anucleotide sequence. In various embodiments, encoding of a set ofinformation into a nucleotide sequence comprises the use of one of theremaining distinguishable feature sets as the encoding symbols, such asmay correspond to binary 0/1, trinary 0/1/2 or quad 0/1/2/3 code, etc.,along with an error correcting encoding to define the set of informationin a way that avoids the hard to read and hard to write features. Inthis way, overall performance of an information storage system isimproved.

In various embodiments, methods of storing information in a nucleotidesequence and retrieving the stored information in the nucleotidesequence is disclosed. In various aspects, the method comprises (a) asystem for synthesizing nucleotide sequences, such as synthesizing DNAmolecules, corresponding to a given sequence of bases. As discussed, thegiven base sequence may be determined through a thoughtful selection ofnucleotides and nucleotide segments that encode a given set ofinformation, such as binary information. In various aspects, the methodcomprises (b) a system for reading signals from a nucleotide sequence,such as from a DNA strand, wherein the nucleotide sequence comprises acollection of distinguishable sequence segments, such that with such aset {X, Y, Z . . . }, each of the sequence segments X, Y, Z . . .occuring within a molecule generate distinguishable signals whenprocessed by a reader. In other examples, the method comprises (c)identifying undesirable nucleotides and nucleotide segments based ontheir propensity to be written incorrectly in the synthesis process, tobe incorporated too slowly in the synthesis process, to generatesecondary structure in the synthesized molecule, or to be too costly touse in the synthesis process. In various embodiments, the methodcomprises (d) identifying undesirable nucleotide and nucleotide segmentsbased on their propensity to be read incorrectly when information isdecoded from a molecule, to read too slowly when information is decoded,and so forth. In various aspects, the method comprises utilizying asynthesis method comprising encoding a set of information into a DNAmolecule, relying on an error detection and/or correcting coding scheme,and using an encoding method wherein one of the feature sets of (b)above is used as the symbol alphabet of the encoding and wherein thisfeature set is selected to not use any of the undesireable featuresdelineated in (c) and (d) above, and using the reading method of (b)above to retrive the information previously encoded in, for example, aDNA molecule.

In various embodiments, the distinguishable features in a nucleotidesequence, such as a DNA molecule, may comprise individual bases. The badreading features may comprise individual specific bases. The bad writingfeatures may also comprise individual specific bases, wherein theencoding scheme corresponds to using an error correcting binary code onthe input information string, with binary symbols 0 and 1 converted to xand y to achieve the DNA encoding.

FIG. 6 illustrates an ideal error aware system that comprises an errorcompensation scheme that is aware of the error modes inherent in the DNAsythesizer and sequencer technologies.

FIG. 7 illustrates an example of 2-way error compensation. The leftportion of the Figure shows errors created in this specific example, andthe right portion of the figure shows avoidance of these errors by anerror-compensating encoding/decoding scheme that produces an inputsequence that does not have the problematic DNA sequence motifs, i.e., Tand C bases. In this illustrated example of FIG. 7, the DNA writer has apropensity to sometimes delete a T, and the DNA reader has a propensityto sometimes read a C as a T. In this example, use of encoding with Tand C can result in errors such as an incoming encoded data sequenceGATTACA reading out as GATATA, wherein one T was deleted in writing, anda C→T reading error occurred). However, if the ideal encoding of datanever had a “T” or “C,” e.g. in the case of encoding all binary 0/1 datasimply and directly with a binary DNA code of A=0, G=1, there would beno errors produced in the storing and retrieval of data.

Thus, in general, in order to reduce errors, the digital dataencoding/decoding algorithm can comprise error detecting and errorcorrecting codes selected to minimize error production, given the actualerror modes of the DNA writer and DNA reader. These codes can be devisedwith the benefit of prior knowledge of the error modes, i.e., thepropensity for particular errors of the writer and reader.

In various embodiments, the error correcting codes reside within asingle nucleotide sequence. For example, one segment of binary data isencoded in one DNA sequence, with the use of error correction and/ordetection schemes on the DNA side. Such schemes may also involveencoding one segment of binary data into multiple DNA sequences, toprovide another level of redundant encoding of information, which isanalogous to error correction through redundant storage. Error detectionschemes include, but are not limited to, repetition code, parity bits,checksums, cyclic redundancy checks, cryptographic hash functions, anderror correcting codes such as hamming codes. Error correction schemesinclude, but are not limited to, automatic repeat request, errorcorrecting code such as convolutional codes and block codes, hybridautomatic repeat request, and Reed-Solomon codes.

In various embodiments, a method of devising an optimal or highlyefficient error correcting encoding, wherein the incoming digital datais considered as binary words of length N, comprises the steps of:providing a space of all DNA words of length M, such that there are manymore possible DNA words than binary words (i.e., 4^(M)>>2^(N)); andselecting a subset of 2^(N) of the DNA words to use as code words forencoding the 2^(N) binary information words, such that when each ofthese DNA code words is expanded into the set of probable DNA writingerrors for the given word, and then that set further expanded by the setof probable reading errors words, these resulting 2^(N) sets of DNAwords remain disjoint with high probability. In such a case, any wordread by the reader can be properly associated back to the ideal encodedDNA word with very high probability. This method constitutes acombination of error correcting and error avoiding encoding ofinformation. In addition, the decoding algorithm would also naturallymake use of confidence or odds information supplied by the reader, toselect the maximum likelihood/highest confidence decoding relative theencoding scheme.

Another key aspect of optimizing the overall DNA data storage systemcosts is the time required to write data. For example, the critical timecost in many embodiments may be the time cost of writing the data. Invarious embodiments, the writing of certain slow-to-synthesize bases andsequence motifs are avoided in order to shorten the overall writingtime. In other aspects, the writing is faster, such as by reducing thetime spent on each chemistry cycle of some cyclical process that writesone base in many parallel synthesis reactions, with acceptance of ahigher overall writing error rate.

Similarly, for reading, a faster reading process may be employed, withthe trade-off being a higher rate of reading errors. In variousexamples, a faster reading process is employed without an increase inerror by avoiding the introduction of certain types of sequences in theencoding that are difficult to read at a rapid rate, such as homopolymerruns. In either case, the information encoding/decoding algorithm can beco-optimized with these choices that allow for faster reading/writingbut with extra error modes to be avoided, or avoiding slow-to-read/writesequence motifs, handled within the encoding/decoding.

These embodiments of cost optimization are illustrated in FIGS. 8 and 9.FIG. 8 illustrates an example of 2-way cost compensation for generalcost reduction/optimization. In this example, there exists both a highcost synthesis, T, and an error mode whereby reading C goes to T. In theembodiment of FIG. 8, error compensation for a generalizedcost-optimizing result, reducing both synthesis costs and errors,comprises the use of encoding that avoids the costly DNA motifs T, andC. FIG. 9 illustrates exemplary factors of a cost optimized encodingsystem. Factored in are the financial cost, speed and error rates of theDNA writer and reader when co-optimizing the encoding/decoding algorithmand the performance parameter section of the DNA reading and writingsystems. The performance parameters of the writer and reader depend onthe DNA sequence as well as other tunable/selectable parameters of thosesystems, and these parameters, as well as algorithm selection andparameters, are co-optimized to reduce or minimize these costs.

In general, there exist a variety of factors in an overall measure of“cost” of the information storage/retrieval process, including errorrates, speed, financial costs of reagents or components, robustness ofthe system or time between failure, etc. These properties of the readerand writer are furthermore generally variable, depending on theoperating parameters (e.g., time allowed for some reaction to complete,purity of chemical reagents used, operating temperature, etc.).

In various embodiments, the choice of, and control parameter settings ofthe encoder/decoder algorithm, and of the writer and reader systems, areco-selected and/or co-optimized, to reduce or minimize some global costfunction or collection of cost functions, (see FIG. 9). In this way, the“cost” performance of the system can be greatly reduced, throughavoidance or mitigation of the higher cost operating scenarios.

Optimization of the DNA Reading Device

In various embodiments of the DNA information storage system herein, theDNA reading device comprises a massively parallel DNA sequencing device,which is capable of a high speed of reading bases from each specific DNAmolecule such that the overall rate of reading stored DNA informationcan be fast enough, and at high enough volume, for practical use inlarge scale archival information retrieval. The rate of reading basessets a minimum time on data retrieval, related to the length of storedDNA molecules.

In various embodiments, a molecular electronics sensor extractsinformation from single DNA molecules, in a way that provides a readerfor digital data stored as DNA. FIG. 10A illustrates the basic conceptof a molecular electronic sensing circuit in which a sensor moleculecomplex completes an electrical circuit, an electrical circuitparameter, such as current “i,” is measured versus time (“t”) to providea signal, wherein variations in signal correspond to interactions of thesensor molecule complex with interacting molecules in the environment ofthe sensor. As illustrated in FIG. 10A, a molecular electronics sensorcomprises a circuit in which a single molecule, or a complex of a smallnumber of molecules, forms a completed electrical circuit spanning thegap between a pair of nano-scale electrodes, and an electronic parameteris modulated by this single molecule or complex, and in which thisparameter is measured as a signal to indicate (“sense”) the singlemolecule or complex interacting with target molecules in theenvironment. In various embodiments, e.g., as indicated in FIG. 10A, themeasured parameter is current passing through the electrodes, versustime, and the molecular complex is conjugated in place with specificattachment points to the electrodes.

FIG. 10B illustrates an embodiment of a polymerase-based molecularelectronics sensor for use herein as a DNA reader device. A sensor, suchas illustrated in FIG. 10B, and comprising a polymerase producesdistinguishable signals from distinct DNA molecular features(abbreviated in the figure as “Feat. 1,” “Feat. 2,” and so forth). Suchfeatures can be used to encode information into synthetic DNA molecules,which can in turn be read via the sensor.

In various embodiments, the molecular complex of an individual sensorcircuit comprises a single polymerase enzyme molecule that engages witha target DNA molecule to produce electrical signals as it processes theDNA template. Under appropriate conditions, such a polymerase willproduce distinguishable electrical signal features, corresponding tospecific distinct features of a template DNA molecular, such asillustrated in FIG. 11 by two different peak shapes in the signal trace.Such distinguishable signal features can therefore be used to encodeinformation in synthetic DNA molecules, through a great variety ofencoding schemes, such as those of FIG. 29, discussed below, andtherefore such a sensor provides the reader for data so encoded.

FIG. 11 illustrates an embodiment wherein a polymerase molecule isconjugated to a bridge molecule that completes the circuit between thetwo electrodes in a pair of electrodes. Current between the electrodesis the measured electrical parameter. When the polymerase engages aproper template, such as a primed, single-stranded DNA molecule, in thepresence of suitable buffer solution and dNTPs, the activity of thepolymerase in synthesizing a complementary strand causes perturbationsin the measured signals related to the detailed kinetics of the enzymeactivity. In this case, the plot of current through the electrodesversus time provides a signal with distinguishable features (such asamplitude variations) corresponding to structural features of the DNAmolecule being processed.

FIG. 12 illustrates an embodiment of a sensor circuit wherein thepolymerase is wired to the electrodes in a pair of electrodes using two“arm” molecules, so as to make the polymerase an essential part of thecurrent path.

FIG. 13 shows another embodiment of a sensor circuit, where there are noarms, and the polymerase is conjugated directly to the two electrodes.In various embodiments, the molecular complex conjugated to thepolymerase, and conjugated to the electrodes, is formed via a series ofone or more molecular self-assembly processes, driven by the highlyspecific and efficient chemistry of various conjugation groups andconjugation reactions.

FIG. 14 shows 3D representations of the molecular structure of onespecific polymerase, namely the Klenow (or Large) Fragment of the E.Coli Polymerase I.

FIGS. 15 and 16 illustrate embodiments of a molecular sensor circuitwherein a Klenow (or Large) Fragment of the E. Coli Polymerase I isconjugated to an abstract molecular bridge molecule (indicated by thebolded bar between electrodes), or conjugated directly into the circuitthrough use of two abstract arm molecules, respectively. FIG. 17illustrates an embodiment of a molecular electronic sensor where theKlenow Fragment of E. Coli Polymerase I is conjugated directly into thecurrent path and directly to the metal electrodes, without the use ofarm or bridge molecules.

FIG. 18 illustrates an embodiment of a working sensor 200 in detail,used herein as a DNA reader device for a DNA data information system.Molecular sensor structure 200 comprises two electrodes 201 and 202,comprising titanium or chromium. Electrodes 201 and 202 may comprise thesource and drain electrodes in a circuit. The electrodes 201 and 202 areseparated by a nanogap of about 10 nm. Other gap distances may berequired to accommodate other lengths of biomolecular bridges. In thisexample, the bridge molecule 203 comprises a double-stranded DNAoligomer molecule of about 20 nm in length (e.g., 60 bases; 6 helicalturns), with thiol groups 204 and 205 at both the 3′ and 5′ ends forcoupling of the bridge molecule 203 to gold contacts 206 and 207provided on each metal electrode 201 and 202. The bonds between the endsof the DNA oligomer and the gold contact points comprise sulfur-goldbonds, available from thiol groups on 5′ ends of the DNA bridge moleculebinding to the gold. The probe molecule in this sensor comprises KlenowFragment of E. Coli Polymerase I molecule 210, chemically crosslinked atcovalent linkage 211 to a Streptavidin protein 212, using a biotinylatedsite on the polymerase, which in turn is coupled to a binding site 214via a biotinylated nucleotide in the synthetic DNA oligo 203. Inoperation, the sensor 200 further comprises a DNA strand 220 beingprocessed by the polymerase 210. The figure approximates the relativesizes of the molecules and atoms.

In various embodiments of a molecular electronics sensor for use herein,the polymerase may be a native or mutant form of Klenow, Taq, Bst, Phi29or T7, or may be a reverse transcriptase. In various embodiments, themutated polymerase forms will enable site specific conjugation of thepolymerase to the bridge molecule, arm molecule or electrodes, throughintroduction of specific conjugation sites in the polymerase. Suchconjugation sites engineered into the protein by recombinant methods ormethods of synthetic biology may, in various embodiments, comprise acysteine, an aldehyde tag site (e.g. the peptide motif CxPxR), atetracysteine motif (e.g., the peptide motif CCPGCC) (SEQ ID NO: 1), oran unnatural or non-standard amino acid (NSAA) site, such as through theuse of an expanded genetic code to introduce a p-acetylphenylalanine, oran unnatural cross-linkable amino acid, such as through use of RNA- orDNA-protein cross-link using 5-bromouridine, (see, e.g., Gott, J. M., etal., Biochemistry, 30 (25), pp 6290-6295 (1991)). The bridge moleculesor arm molecules may, in various embodiments, comprise double strandedDNA, other DNA duplex structures, such as DNA-PNA or DNA-LNA or DNA-RNAduplex hybrids, peptides, protein alpha-helix structures, antibodies orantibody Fab domains, graphene nanoribbons or carbon nanotubes, or anyother of a wide array of molecular wires or conducting molecules knownto those skilled in the art of molecular electronics. The conjugationsof polymerase to such molecules, or of such molecules to the electrodes,may be by a diverse array of conjugation methods known to those skilledin the art of conjugation chemistry, such as biotin-avidin couplings,thiol-gold couplings, cysteine-maleimide couplings, gold or materialbinding peptides, click chemistry coupling, Spy-SpyCatcher proteininteraction coupling, antibody-antigen binding (such as the FLAG peptidetag/anti-FLAG antibody system), and the like. Coupling to electrodes maybe through material binding peptides, or through the use of a SAM(Self-Assembling-Monolayer) or other surface derivatization on theelectrode surface to present suitable functional groups for conjugation,such as azide or amine groups. The electrodes comprise electricallyconducting structures, which may comprise any metal, such as gold,silver, platinum, palladium, aluminum, chromium, or titanium, layers ofsuch metals in any combination, such as gold on chromium, orsemiconductors, such as doped silicon, or in other embodiments, acontact point of a first material on a support comprising a secondmaterial, such that the contact point is a site that directs chemicalself-assembly of the molecular complex to the electrode.

In various embodiments, electrical parameters measured in a sensor, suchas the sensor illustrated in FIG. 18, can in general be any electricalproperty of the sensor circuit measurable while the sensor is active. Inone embodiment, the parameter is the current passing between theelectrodes versus time, either continuously or sampled at discretetimes, when a voltage, fixed or varying, is applied between theelectrodes. In various embodiments, a gate electrode is capacitivelycoupled to the molecular structure, such as a buried gate or back gate,which applies a gate voltage, fixed or variable, during the measurement.In various other embodiments the measured parameter may be theresistance, conductance, or impedance between the two electrodes,measured continuously versus time or sampled periodically. In variousaspects, the measured parameter comprises the voltage between theelectrodes. If there is a gate electrode, the measured parameter can bethe gate voltage.

In various embodiments, the measured parameter in a molecularelectronics sensor, such as the sensor of FIG. 18, may comprise acapacitance, or the amount of charge or voltage accumulated on acapacitor coupled to the circuit. The measurement can be a voltagespectroscopy measurement, such that the measurement process comprisingcapturing an I-V or C-V curve. The measurement can be a frequencyresponse measurement. In all such measurements, for all such measuredparameters, there are embodiments in which a gate electrode applies agate voltage, fixed or variable, near the molecular complex during themeasurement. Such a gate will typically be physically located within amicron distance, and in various embodiments, within a 200 nm distance ofthe molecular complex. For the electrical measurements, in someembodiments there will be a reference electrode present, such as aAg/AgCl reference electrode, or a platinum electrode, in the solution incontact with the sensor, and maintained at an external potential, suchas ground, to maintain the solution at a stable or observed potential,and thereby make the electrical measurements better defined orcontrolled. In addition, when making the electrical parametermeasurement, various other electrical parameters may be held fixed atprescribed values, or varied in a prescribed pattern, such as, forexample, the source-drain electrode voltage, the gate voltage if thereis a gate electrode, or the source-drain current.

The use of a sensor, such as the sensor illustrated in FIG. 18, tomeasure distinguishable features of a DNA molecule requires thepolymerase to be maintained in appropriate physical and chemicalconditions for the polymerase to be active, to process DNA templates,and to produce strong, distinguishable signals above any backgroundnoise (i.e., high signal-to-noise ratio, or “SNR”). To achieve this, thepolymerase may reside in an aqueous buffer solution. In variousembodiments, a buffer solution may comprise any combination of salts,e.g. Nalco or KCl, pH buffers, Tris-HCl, multivalent cation cofactors,Mg, Mn, Ca, Co, Zn, Ni, Fe or Cu, or other ions, surfactants, such asTween, chelating agents such as EDTA, reducing agents such as DTT orTCEP, solvents, such as betaine or DMSO, volume concentrating agents,such as PEG, and any other component typical of the buffers used forpolymerase enzymes in molecular biology applications and known to thoseskilled in the field of molecular biology. The sensor signals may alsobe enhanced by such buffers being maintained in a certain range of pH ortemperature, or at a certain ionic strength. In various embodiments, theionic strength may be selected to obtain a Debye length (electricalcharge screening distance) in the solution favorable for electricalsignal production, which may be, for example, in the range of from about0.3 nm to about 100 nm, and in certain embodiments, in the range of fromabout 1 nm to about 10 nm. Such buffers formulated to have larger Debyelengths may be more dilute or have lower ionic strength by a factor of10, 100, 1000, 100,000 or 1 million relative to the bufferconcentrations routinely used in standard molecular biology proceduressuch as PCR. Buffer compositions, concentrations and conditions (pH,temperature, or ionic strength, for example) may also be also selectedor optimized to alter the enzyme kinetics to favorably increase thesignal-to-noise ratio (SNR) of the sensor, the overall rate of signalproduction, or overall rate of information production, in the context ofreading data stored in DNA molecules. This may include slowing down orspeeding up the polymerase activity by these methods, or altering thefidelity or accuracy of the polymerase. This optimal buffer selectionprocess consists of selecting trial conditions from the matrix of allsuch parameter variations, empirically measuring a figure of merit, suchas related to the discrimination of the distinguishable features, or tothe over speed of feature discrimination when processing a template, andusing various search strategies, such as those applied in statisticalDesign Of Experiment (DOE) methods, to infer optimal parametercombinations.

The use of a sensor such as the sensor of FIG. 18 to measuredistinguishable features of a DNA molecule requires the polymerase beprovided with a supply of dNTPs so that the polymerase can actprocessively on a template single-stranded DNA molecule to synthesize acomplementary strand. The standard or native dNTPs are dATP, dCTP, dGTP,and dTTP, which provide the A, C, G, and T base monomers forpolymerization into a DNA strand, in the form required for the enzyme toact on them as substrates. Polymerase enzymes, native or mutant, mayalso accept analogues of these natural dNTPs, or modified forms, thatmay enhance or enable the generation of the distinguishable signals.

In various aspects of DNA reading herein, if a system reads a DNAmolecule at a speed of 1 base per 10 minutes, as is representative ofcurrent next generation, optical dye-labeled terminator sequencers, thenreading a 300 base DNA molecule takes at least 3,000 minutes (50 hours),aside from any time required to prepare the sample for reading. Suchrelatively slower systems therefore favor storing information in alarger number of shorter reads, such as 30 base reads that could be readin 5 hours. However, this requires a larger number of total reads, sothe system must support billions or more such reads, as it the case onsuch sequencers. The current generation of optical massively parallelsequencers, read on the order of 3 billion letters of DNA per 6-minutecycle, or roughly the equivalent of 1 billion bits per minute, or 2 MBper second, although for data stored as 100 base DNA words, this wouldalso require 600 minutes (5 hours). This can be seen to be a relativelylow rate of data reading, although within a practical realm, as atypical book may contain 1 MB of textual data. The overall rate ispractical, but the slow per base time makes this highly inefficient forreading a single book of data, and ideally matched to bulk reading of36,000 books in parallel, over 5 hours. Thus, there is also a lack ofscalability in this current capability, and also a high capital cost ofthe reading device (optical DNA sequencers cost in the $100,000 to$1,000,000 range presently). More critically, on such current systems,the cost of sequencing a human genome worth of DNA, 100 billion bases,is roughly $1,000, which means the cost of reading information is $1,000per 200 Giga-bits, or $40 per GB. This is radically higher than the costof reading information from magnetic tape storage or CDs, which is onthe order of $1 per 10,000 GB, or $0.0001 per GB, 400,000 fold lesscostly. Thus the cost of reading DNA should be reduced by several ordersof magnitude, even by 1,000,000 fold, to make this attractive for largescale, long term archival storage, not considering other advantages.Such improvements may indeed be possible, as evidenced by themillion-fold reduction in costs of sequencing that has already occurredsince the first commercial sequencers were produced.

In various embodiment, the DNA reader of the present system comprisessubstantially lower instrument capital costs, and higher per-basereading speed, and greater scalability in total number of reads per run,compared to currently available optical next generation sequencinginstruments. In various aspects, the reading device for use herein isbased on a CMOS chip sensor array device in order to increase the speedand scalability and decrease the capital costs. An embodiment of such adevice comprises a CMOS sensor array device, wherein each sensor pixelcontains a molecular electronic sensor capable of reading a singlemolecule of DNA without any molecular amplification or copying, such asPCR, required. In various embodiments, the CMOS chip comprises ascalable pixel array, with each pixel containing a molecular electronicsensor, and such a sensor comprising a bridge molecule and polymeraseenzyme, configured so as to produce sequence-related modulations of theelectrical current (or related electrical parameters such as voltage,conductance, etc.) as the enzyme processes the DNA template molecule.

An exemplary molecular sensor and chip combination usable as a DNAreader device in the present DNA data storage system is depicted inFIGS. 18, 19, 30A, 30B, and 31-33. As discussed, FIG. 18 illustrates anexemplary molecular sensor comprising a bridge and probe moleculestructure further comprising a bridge of double stranded DNA havingabout a 20 nm length (˜60 bases), with thiol groups at both 5′ ends forcoupling to gold contacts on a metal electrode. The embodiment of FIG.18 comprises a polymerase enzyme coupled to a molecular wire comprisedof DNA, which plugs into a nano-electrode pair to form a sensor capableof producing sequence-related signals as the polymerase enzyme processesa primed DNA template.

As illustrated in FIG. 19, such a nano-sensor can be placed bypost-processing onto the pixels of a CMOS sensor pixel array, whichfurther comprises all the supporting measurement, readout and controlcircuitry needed to produce these signals from a large number of sensorsoperating in parallel. FIG. 19 illustrates an embodiment of variouselectrical components and connections in molecular sensors. In the upperportion of the figure, a cross-section of an electrode-substratestructure 300 is illustrated, with attachment to an analyzer 301 forapplying voltages and measuring currents through the bridge molecule ofthe sensor. In the lower portion of the figure, a perspective view ofelectrode array 302 is illustrated, usable for bridging circuits. Eachpair of electrodes comprises a first metal (e.g., “Metal-1”), and acontact dot or island of a second metal (e.g., “Metal-2”) at eachelectrode end near the gap separating the electrodes. In variousexamples, Metal-1 and Metal-2 may comprise the same metal or differentmetals. In other aspects, the contact dots are gold (Au) islands atopmetal electrodes comprising a different metal. In various experiments,contact dots comprise gold (Au) beads or gold (Au)-coated electrode tipsthat support self-assembly of a single bridge molecule over each gapbetween electrode pairs, such as via thiol-gold binding.

FIGS. 20A, 20B and 20C show electron micrograph (EM) images ofelectrodes comprising gold metal dot contacts for bridge binding in DNAsensors. In this example, electrodes are on a silicon substrate, andwere produced via e-beam lithography. FIG. 20A shows an array oftitanium electrodes with gold dot contacts. In FIG. 20B, a close-up EMshows an electrode gap of about 7 nm with gold dot contacts and withabout a 15 nm gold-to-gold spacing. In FIG. 20C, a close-up EM showsgold dots of approximately 10 nm in size positioned at the tips of theelectrodes.

FIG. 21 sets forth current versus time plots obtained by measuring DNAincorporation signals with the sensor of FIG. 18. The plots show thecurrent signals resulting from the sensor being supplied with variousprimed, single stranded DNA sequencing templates and dNTPs forincorporation and polymerization. In each case, the major signal spikesrepresent signals from discrete incorporation events, wherein thepolymerase enzyme adds another base to the extending strand. At theupper left of FIG. 21, the template is 20 T bases; at the upper right,the template is 20 G bases; at the lower left, the template is 20 Abases; and at the lower right, the template is 20 C bases. Theapproximate rate of incorporation observed is about 10-20 bases persecond, consistent with standard enzyme kinetics except for the lowerrate of ˜1 base per second due to rate limiting factors (e.g., lowerdNTP concentration).

FIG. 22 illustrates the principle of using modified forms of dNTPs toproduce distinguishable signals, showing an example wherein all 4 dNTPscarry distinct modifications that result in 4 distinguishable signalsfrom the four bases of the template DNA. Many such modified forms arewell known to those skilled in the field of nucleic acid biochemistry,and all such forms may be enabling for signal production in variousembodiments. This includes dNTPs that have modification to the base, thesugar, or the phosphate group. For example, common modified forms ofdNTPs include deaza-, thio-, bromo- and iodo-modifications at varioussites on the molecule, or the inclusion of metal ions or differentisotopes at various sites, the inclusion of diverse dye molecules atvarious sites, or methylation of various sites, or biotinylation ofvarious sites. Various modifications include forms that have an extendedphosphate chain, beyond the native tri-phosphate, to lengths such astetra-, penta-, hexa-, hepta- or more (4 or more, up to 11 or more)phosphates. Other examples of modification comprise a chemical groupadded to the terminal phosphate of the phosphate chain, or any of thephosphates which are cleaved off during incorporation (all but thealpha-phosphate or first in the chain). Polymerases are highly tolerantof such groups, and retain a high level of activity in their presence.Thus, such groups provide a great capacity for modified dNTPs that aidin forming distinguishable signals. In various embodiments, such groupsmay have different charge states, or different sizes, or differentdegrees of hydrophobicity, which may aid in producing different signals,or such added groups may interact selectively with the sites on thebridge molecule or on the polymerase or the template DNA to producedistinguishable signals. FIG. 22 illustrates the addition of such groupsonto the phosphate chain, to produce distinguishable signals ofincorporation for the four bases of the template.

FIG. 23 illustrates the use of two distinct modified dNTPs, a modifieddATP, indicated as A*, and a modified dCTP, indicted as C*, to providetwo distinguishable signals resulting from their incorporation againstthe respective complementary standard bases T and G of the template DNA.The use of the two modified dNTPs provides a method to encode binarybits 0/1 into the template DNA. Thus the T and G features of the DNAtemplate produce distinguishable signals, and can be used to encodeinformation readable by this sensor.

FIG. 24 illustrates the use of two different sequence motifs, herehomopolymers AA and CCC, to produce two distinguishable signals, whichprovides a means to encode binary bits 0/1 into the template DNA. Inthis case, AA and CCC provide two distinguishable sequence motifs thatcan be used for information encoding and recovery. Thus, usefulinformation encoding and reading is possible even without single baseresolution of DNA sequences, by instead relying on distinguishablesequence motifs.

FIG. 25 shows experimental data obtained from the sensor of FIG. 18 inwhich specific sequence motifs produced signals that are usable toencode 0/1 binary data. The sensor of FIG. 18 comprises the Klenowpolymerase conjugated to a DNA bridge, which produces distinguishablesignals from the encoding DNA sequence motifs 20A, 3C and 30A in theexperimental template DNA. Such signals were produced by using thesensor of FIG. 18 in conjunction with a standard 1× Klenow buffer andrelatively high concentration of dTTP, (10 μM), and 100 times lowerconcentration of the other dNTPs. The lower concentration of the otherdNTPs, notably the low dGTP concentration, facilitates thedistinguishable signal from the CCC region via the concentration-limitedrate of incorporation. The result is that the poly-A tract has a highspike signal feature, and the poly-C tract has a low trough signalfeature, which are readily distinguishable. The peaks and trough areusable to encode 0/1 binary data in the simple manner illustrated, with0 encoded by the poly-A tract and read from the high peak signals havingseveral seconds duration, and 1 encoded by the CCC tract and read fromthe low trough features having several seconds duration.

FIG. 26A illustrates an embodiment of binary encoding wherein twodifferent sequence motifs, GATT and ACA, produce two distinguishablesignals that provide a method to encode binary bits 0/1 into thetemplate DNA.

FIG. 26B illustrates another embodiment of binary encoding wherein threedifferent sequence motifs, GATT, ACA and AGG, produce threedistinguishable signals that provide a method to encode digital datawith a three-state encoding.

The use of the sensors of the present disclosure to measuredistinguishable features of a DNA molecule requires the polymerase beprovided with primed, single-stranded template DNA molecules as asubstrate for polymerization of a complementary strand, in the course ofgenerating the associated signals. In the context of encodinginformation in synthetic DNA molecules, these template molecules may bewholly chemically synthetic, and can therefore be provided with chemicalor structural modifications or properties beyond those of native DNA,which may be used to enable or enhance the production of distinguishablesignals for various embodiments. The polymerase, native or an engineeredmutant, can accept as a substrate a great many such modified or analogueforms of DNA, many of which are well known to those skilled in the fieldof molecular biology. The use of such modifications to the template DNAcan be used to create features with distinguishable signals. FIG. 26Cshows a case where the template DNA is synthesized from two baseanalogues, X and Y, analogs of A and C. The dNTPs provided are thestandard dATP and dCTP, which produce enhanced, distinguishable signalswhen incorporated against the modified bases X and Y in the template.Thus the DNA template synthesized using X and Y analogs can be used toencode information that can be read with this sensor. FIG. 26D shows afurther extension of this, wherein the encoding DNA is synthesized with8 different bases and base analogues, namely the four standard bases A,C, G, T, and the modified base analogs of these, X, Y, Z, W,respectively, to produce 8 distinguishable signals when the standarddNTPs incorporate against them in complementary fashion. Thus, templateDNA synthesized from these 8 bases and analogs, A, C, G, T, X, Y, Z, W,provide 8 distinguishable signals, and thus can be used for 8 stateencoding (such as shown in FIG. 29, scheme BES5).

In various embodiments, the DNA supplied to the polymerase as a templatecomprises some form of primed (double stranded/single strandedtransition) site to act as an initiation site for the polymerase. Forthe purpose of storing digital data in DNA, in various embodiments, thispriming will be pre-assembled into the encoding molecule, so that nofurther sample preparation is needed to prime the DNA templatemolecules. FIGS. 27A and 27B show embodiments wherein DNA data storagemolecules have a universal priming structure, pre-assembled.

FIG. 27A sets forth four embodiments of primer configuration usable forstorage templates. These include, in descending order in theillustration: (1) primed strand, with oligo primer hybridized to thetemplate; (2) primer cross-linked to the strand for stability; (3)hairpin primer, with a hairpin bend of DNA or other linker molecule,such as PEG polymer linker; and (4) hairpin primer cross-liked in place.FIG. 27B sets forth embodiments of strand architecture that enable thepolymerase-based sensor to interrogate the same data payload multipletimes. Further considerations of these four embodiments are detailedherein.

In various embodiments, primer constructs comprise any of:

(1) a pre-hybridized universal primer oligo, e.g., of native DNA,optionally having a high melting point or high GC content, or a morestably hybridizing form such as PNA or LNA;

(2) a primer oligo modified with additional cross-linked bases (e.g.,bromodeoxyuridine), covalently bound or otherwise strongly chemicallycoupled in place, so that there is greatly reduced chance of the primernot being in place;

(3) a hairpin primer as part of the DNA template, so that the moleculeis preferentially self-priming, wherein the hairpin primer may either becomposed entirely of DNA, or a hairpin loop (which in various examplesis DNA, or an alternative flexible molecule such as a PEG polymer strandor multi-carbon linker, such as a C3, C6, or longer linker) that allowsa hairpin bend, attached to a hybridizing oligo portion, which can beDNA, e.g., having a high melting point or high GC content, or a morestably hybridizing analog such as PNA or LNA;

(4) a hairpin primer, wherein the hybridizing oligo is modified withadditional cross-linked bases, covalently bound or otherwise stronglychemically coupled in place, so that there is greatly reduced chance ofthe primer not being in place. In various embodiments, halogenatedthiopyrimidines and bromodeoxyuridines (e.g., 5′-bromo-2′-deoxyuridineas substitute for thymidine) are photoreactive halogenated bases thatcan be incorporated into oligonucleotides to crosslink them to DNA, RNAor proteins with exposure to UV light. In various examples, crosslinkingis maximally efficient with light at a wavelength of 308 nm. See, e.g.,Cleaver, J. E., Biophys. J., 8, 775-91 (1968); Zeng, Y., et al., NucleicAcids Res., 34(22), 6521-29 (2006); and Brem, H., et al., J. ofPhotochemistry and Photobiology B: Biology, 145, 216-223 (2015).

Since the secondary structure in a DNA template can interfere with theprocessive action of a polymerase, it may be advantageous to reduce,avoid or eliminate secondary structure in the DNA data encoding templatemolecules used in DNA data reader sensors. Many methods to reducesecondary structure interference are known to those skilled in the fieldof molecular biology. Methods to reduce, avoid or eliminate secondarystructure include, but are not limited to: using polymerases thatpossess strong secondary structure displacing capabilities, such asPhi29 or Bst or T7, either native or mutant forms of these; adding tothe buffer solvents such as betaine, DMSO, ethylene glycol or1,2-propanediol; decreasing the salt concentration of the buffer;increasing the temperature of the solution; and adding single strandbinding protein or degenerate binding oligos to hybridize along thesingle strand. Methods such as these can have the beneficial effect ofreducing secondary structure interference with the polymerase processingthe encoding DNA and producing proper signals.

Additional methods available to reduce unwanted secondary structure forDNA data reading in accordance to the present disclosure comprise addingproperties to DNA molecules produced by synthetic chemistry. Forexample, in some embodiments of the present disclosure, the dataencoding the DNA molecule itself can be synthesized from base analoguesthat reduce secondary structure, such as using deaza-G(7-deaza-2′-deoxyguanosine) in place of G, which weakens G:C basepairing, or by using a locked nucleic acid (LNA) in the strand, whichstiffens the backbone to reduce secondary structure. A variety of suchanalogues with such effects are known to those skilled in the field ofnucleic acid chemistry.

Further methods are available in the present disclosure to reduceunwanted secondary structure for the DNA data reading sensor, becausethe DNA data encoding scheme determines the template sequence, and thusthere is potential to choose the encoding scheme to avoid sequencesprone to secondary structure. Such Secondary Structure Avoiding (“SSA”)encoding schemes are therefore a beneficial aspect of the presentdisclosure. In general, for encoding schemes as described herein, whichuse distinguishable signal sequence features as the encoding elements,to the extent there are options in the choice of encoding rules (such asexemplified in FIG. 29), all such alternative schemes could beconsidered, and the schemes that produce less (or the least) secondarystructure would be favored for use. The alternative schemes are assessedrelative to a specific digital data payload, or statistically across arepresentative population of such data payloads to be encoded.

For example, the importance of SSA encoding is illustrated in theembodiment where the sensor provides three distinguishable signalsequence features: AAAAA, TTTTT, and CCCCC. If all three features areused in encoding in the same strand (or on other strands), there is astrong potential for the AAAAA and TTTTT encoding elements, beingcomplementary, to hybridize and lead to secondary structure, eitherwithin the strand or between DNA strands. Thus, if the data were insteadencoded entirely by the scheme wherein 0→AAAAA and 1→CCCCC, i.e.,ignoring the use of TTTTT completely, all such potential secondarystructure is avoided. Thus, this encoding (or the other SSA choice,0→TTTTT and 1→CCCCC) is preferred over a scheme that usesself-complementary sequences, even though information density is reducedby giving up one of the three available encoding elements. Thus, ingeneral, SSA codes can be used when there are encoding options and whenthere is a potential for DNA secondary structure to form. As shown inthis embodiment, desirable SSA codes to reduce DNA secondary structuremay be less information dense than what is theoretically possible forthe distinguishable signal states. However, this tradeoff can result ina net gain of information density, or related overall cost or speedimprovements, by avoiding data loss related to DNA secondary structure.

In various embodiments, methods for reducing secondary structurecomprises the use of binding oligos to protect the single strand,wherein the oligos are chosen with sequence or sequence composition thatwill preferentially bind to the encoding features. Such binding oligosmay more effectively protect the single strand and general degenerateoligos. For example, in the case described above with threedistinguishable signal sequence features AAAAA, TTTTT, and CCCCC, allthree could be used as encoding features, and they could be protected insingle stranded form by binding the template to the oligos TTTTT, AAAAA,and GGGGG, or to enhanced binding analogues of these, such as RNA, LNAor PNA forms, instead of DNA. Thus, use of binding oligos thatpreferentially bind to the encoding features is another means tomitigate unwanted secondary structure effects, although such bindingoligos must be used with strand-displacing polymerases, such as nativeor mutant forms of Klenow, Bst or Phi29, such that the oligos themselvesdo not interfere. A further method for avoiding secondary structure isto prepare the information encoding DNA in primarily double strandedform, with a nick or gap at the primer site for polymerase initiation,and the rest of the molecule in duplex form, such as is illustrated bythe second strand in FIG. 27B (with or without a hairpin bend) so thatthe DNA molecule exists in solution in a substantially duplex form, freeof secondary structure due to single-strand interactions, within orbetween molecules.

In various embodiments, DNA molecules used to encode information forreading by the cognate molecular sensor can be prepared with anarchitecture that facilities the reading process as well as the encodingand decoding processes. Various embodiments of DNA architecture areillustrated in FIG. 28. Illustrated is a representative physical form ofa primed single-stranded DNA template (at the top of the drawing), alongwith the logical forms of an information encoding molecule for use in adigital data storage system. Exemplary forms may include Left and RightAdapters (shown as “L ADAPTOR” and “R ADAPTOR”), to facilitatemanipulation of the information coding DNA molecules, a primer (e.g.,pre-primed or self-priming, shown as “PRIMER”), left and right buffersegments (shown as “L-BUFFER” and “R-BUFFER”) and a data payload segment(“DATA PAYLOAD”).

With continued reference to FIG. 28, the adapters may comprise, forexample, primers for universal amplification processes, used to copy thestored data, or may comprise hybridization capture sites or otherselective binding targets, for targeted selection of molecules from apool. In various embodiments, a primer segment contains primertarget/structure, the L-BUFFER segment may contain a signal calibrationsequence for the reader, or a buffering sequence prior to the DATAPAYLOAD segment, which contains information storing encoded sequence andrelated error correction sequence such as parity bits. In variousaspects, the R-BUFFER may contain an additional calibration sequence, aswell as a buffer sequence preventing the polymerase enzyme getting tooclose to the end of the template when reading data. In variousembodiments, the L-ADAPTER and R-ADAPTER may be sequence elementsrelated to the storage or manipulation of the associated DNA segment,such as adapters for outer priming cites for PCR amplification, orhybridization based selection, or representing a surrounding carrier DNAfor this insert, including insertion into a host organism genome as acarrier. In various embodiments, the adapters may comprise surroundingor carrier DNA, for example in the case of DNA data molecules stored inlive host genomes, such as in bacterial plasmids or other genomecomponents of living organisms.

With further reference to FIG. 28, the L-BUFFER and R-BUFFER segmentsmay comprise DNA segments that support the polymerase binding footprint,or various calibration or initiation sequences used to help interpretthe signals coming from the data payload region. These buffer segmentsmay contain molecular barcode sequences that are used to distinguishunique molecules, or to identify replicate molecules that are derivedfrom the same originating single molecule. One such method of barcoding,known to those skilled in DNA oligo synthesis, comprises the addition ofa short random N-mer sequence, typically 1 to 20 bases long, made forexample by carrying out synthesis steps with degenerate mixtures ofbases instead of specific bases.

With continued reference to FIG. 28, DNA logical structures comprise adata payload segment wherein specific data is encoded. In variousembodiments, a data payload segment comprises the actual primary digitaldata being stored along with metadata for the storage method, which maycomprise data related to proper assembly of such information fragmentsinto longer strings, and/or data related to error detection andcorrection, such as parity bits, check sums, or other such informationoverhead.

In various embodiments, a data payload DNA structure results from asensor-specific information encoding scheme applied to a source digitaldata payload, such as binary data, as illustrated in FIG. 29. In thisscenario, the originating digital data that is to be stored as DNA willtypically have a prior representation as electronic binary data (1/0bits). In various embodiments of encoding, this originating data willbe: (i) divided into segments, (ii) augmented by re-assembly data, and(iii) transformed by error correcting encodings appropriate for DNA datastorage to produce the actual binary data payload segments, such asillustrated in the examples of FIG. 29. These actual binary data payloadsegments then require translation into DNA payload sequences usable inthe subsequent synthesis of the DNA physical storage molecules. Invarious embodiments, the primary translation is performed by BinaryEncoding Schemes (“BES”), such as, for example, shown in FIG. 29. Theseencoding schemes provide primary translation from a digital data format,such as binary, to a DNA molecular sequence format.

Which BES is appropriate is directly related to the distinguishablesignal feature sets of the sensor, as exemplified in FIG. 11. FIG. 29illustrates several such primary encodings, using an exemplary binarydata payload, namely the 32-bit word shown at the top of the drawingfigure. Exemplary binary encoding schemes (BES) shown are:

BES1: a standard encoding of four 2 bits into four standard DNA letters(one DNA letter per two binary bits), for use with a reader sensor thatcan distinguish these features, (e.g., FIG. 22);

BES2: the encoding of two binary digits into two bases (one DNA letterper one binary bit), for use with a reader sensor that can distinguishthese, (such as distinguishing between T and G in FIG. 23);

BES3: encoding two binary digits into two runs of bases, AA and CCC,(one run of DNA letters per one binary bit), to encode two binary statesfor use with a reader sensor that can distinguish these features, (suchas distinguishing between AA and CCC in FIG. 24);

BES4: using DNA molecules composed of two modified bases, X, Y, (onemodified base per one binary bit), to encode the two binary states, foruse with a reader sensor that can distinguish these modified bases inthe template, (such as distinguishing X and Y in FIG. 26C);

BES5: using DNA molecules composed of 4 native bases and 4 modifiedbases, to encode the eight possible 1/0 3-bit states, (one DNA base ormodified base per 3-bits of data), for use with a sensor that candistinguish between all eight base features, (such as distinguishingbetween A, C, G, T, X, Y, Z, and W in FIG. 26D); and

BES6 using two DNA sequence motifs, to encode two binary states, (oneDNA sequence motif per one binary bit), for use with a reader sensorthat can distinguish the signals of these motifs, (such asdistinguishing between GATT and ACA in FIG. 26A).

As seen in the examples of FIG. 29, the encoding of the binary datapayload for the multi-bit encoding schemes BES1 and BES5 shorten thelength of the encoding string in passing from binary to DNA sequence,while the multi-base encoding schemes BES3 and BES6 lengthen the lengthof the encoding string upon encoding. Code schemes that produce shortersequences are preferred when reducing the length of synthesized DNAinformation encoding molecules is a high priority, e.g., for example,when there are practical limitations on oligo length for the writingtechnology. Further as seen in the examples of FIG. 29, BES2 and BES4schemes retain the length of the encoding string in passing from binaryto DNA sequence. With further reference to FIG. 29, the lower portion ofthe Figure sets forth the DNA sequences obtained when converting theexemplary binary data payload word (at the top of the drawing figure)with the encoding schemes BES1, BES2, BES3 and BES5.

Binary encoding schemes for use herein are not limited to the examplesset forth in FIG. 29, and many variations or similar encoding schemes tothose shown in FIG. 29 are also implicit in these examples, such as bypermuting the letters used, or changing the lengths of sequencingmotifs, the composition of sequence motifs, and/or the choice ofmodified or analog bases. It is also understood that all such encodingschemes have a cognate sensor that is capable of distinguishing thesignals of the encoding features, so that the choices of BES aredirectly related to the properties of the sensor in distinguishingfeatures. It is also understood that, even though the examples of FIG.29 exemplify cases with 2, 4 or 8 distinguishable features, forconvenience in describing bit encodings of 1, 2 or 4 bits, encodings ofbinary data can be done based on any number of distinguishable signalfeatures, such as 3 distinguishable features as in the sensor of FIG.26B.

In various embodiments, information as binary data such as 010011100010may be encoded using three states A, B, C, wherein 0 is encoded as A, 1is encoded as B, and 00 is encoded as C whenever 00 occurs, (i.e., tonot encode 00 as AA). In accordance to this scheme, the binary word010011100010 is equivalent to the encoded form ABCBBBCABA. Similarly,digital data formats or alphabets other than binary, such ashexadecimal, decimal, ASCII, etc., can be equally encoded by similarschemes as those exemplified in FIG. 29. Such methods are well known tothose skilled in the field of computer science. Schemes moresophisticated than those shown, in terms of optimal information density,such as Lempel-Ziv encoding, can highly efficiently convert and compressdata from one alphabet into another.

In general, for converting a binary or other digital data payload stringor collection of strings into a DNA sequence string or collection ofsuch strings, many of the methods of lossless and lossy encoding orcompression, e.g., those well known in computer science, can be used todevise schemes for the primary conversion of input digital data payloadsto DNA sequence data payloads, as strings of distinguishable feature DNAsegments, generalizing the examples of FIG. 29. In this broader context,the BES schemes exemplified in FIG. 29 illustrate the type of featureelements that could become symbols of an alphabet for data encoding,such as standard bases, modified bases, or sequence motifs or runs, andthat such elements must have a cognate reader sensor.

In an illustrative embodiment, a sensor distinguishes between the sensormotifs CCC, AA, and G, represented herein as “a,” “b,” and “c,”respectively, wherein a binary data string is encoded into these symbolsin accordance with a lossless or lossy data encoding or compressionscheme as string “aabcacb.” In this embodiment, the string aabcacb wouldbe directly translated into DNA sequences CCC, CCC, AA, G, CCC, G, andAA. These segments can then be directly converted into a DNA datapayload sequence, in this case CCCCCCAAGCCCGAA (SEQ ID NO: 2).

In certain variations, there may be “punctuation” sequences insertedbetween distinguishable signal features, which do not alter thedistinguishable features but that may provide benefits such asaccommodating special properties or constraints of the DNA synthesischemistry, or to provide spacers for added time separation betweensignal features, or to improve the secondary structure of the DNAmolecule. For example, if T were such a punctuation sequence, the DNAencoding sequence in the above example would becomeTCCCTCCCTAATGTCCCTGTAAT (SEQ ID NO: 3), (i.e., a punctuating “T”inserted between each of the sequence segments CCC, CCC, AA, G, CCC, G,and AA). In general, such insertion of punctuation sequences or fillersequences may be part of the process of translating from a digital datapayload to the DNA encoding sequence to be synthesized.

In various aspects of the present disclosure, a DNA data payload ofinterest is processed by a polymerase sensor multiple times to provide amore robust recovery of digital data from DNA storage. In other aspects,a collection of such payloads on average are processed some expectednumber of multiple times. These examples benefit from a more accurateestimation of the encoding distinguishable features by aggregating themultiple observations. Multiple processing also has the benefit ofovercoming fundamental Poisson sampling statistical variability toensure that, with high confidence, a data payload of interest is sampledand observed at least once, or at least some desirable minimal number oftimes.

In various embodiments, the number of such repeat interrogations is inthe range of 1 to about 1000 times, or in the range of about 10 to 100times. Such multiple observations may comprise: (i) observations of thesame physical DNA molecule by the polymerase sensor, and/or (ii) one ormore polymerase sensors processing multiple, physically distinct DNAmolecules that carry the same data payload. In the latter case, suchmultiple, physically distinct DNA molecules with the same data payloadmay be the DNA molecules produced by the same bulk synthesis reaction,the molecules obtained from distinct synthesis reactions targeting thesame data payload, or replicate molecules produced by applyingamplification or replication methods such as PCR, T7 amplification,rolling circle amplification, or other forms of replication known tothose skilled in molecular biology. The aggregation of such multipleobservations may be done through many methods, such as averaging orvoting, maximum likelihood estimation, Bayesian estimation, hiddenMarkov methods, graph theoretic or optimization methods, or deeplearning neural network methods.

In various aspects of the present disclosure, molecular biology methodsenable the polymerase sensor to interrogate the same DNA molecule datapayload multiple times. Three such embodiments of template architecturesare shown in FIG. 27B. One such method is to circularize the molecule(upper figure), such as via ligation of the ends, and use a stranddisplacing polymerase that can repeat the process around the circlemultiple times, thereby generating multiple reads of the same molecule.Another method is to have a hairpin duplex (middle figure), and use astrand displacing polymerase that processes the lower strand, wrapsaround the hairpin feature, and processes the upper strand of themolecule that comprises a complementary sequence, which, in someembodiments, may also provide distinguishable signals related to thelower strand distinguishable signals. One other embodiment, illustratedin the lower figure of FIG. 27B, is to construct the template moleculewith a tandem repeat of the data payload, repeated one or more times, sothat the processive enzyme will process through multiple instances ofthe same data payload.

In various embodiments of the present disclosure, digital data stored inDNA is read at a high rate, such as approaching 1 Gigabyte per secondfor the recovery of digital data, as is possible with large scalemagnetic tape storage systems. Because the maximum processing speed of apolymerase enzyme is in the range of 100-1000 bases per second,depending on the type, the bit recovery rate of a polymerase-basedsensor is limited to a comparable speed. Thus, in various embodimentsmillions of sensors are deployed in a cost effective format to achievethe desired data reading capacity.

In various embodiments, many individual molecular sensors are deployedherein in a large scale sensor array on a CMOS chip, which is the mostcost-effective, semiconductor chip manufacturing process. FIG. 30Aillustrates an embodiment of a fabrication stack usable to create amassively parallel array of molecular sensors on a chip. In thisexample, the sensor measurement circuitry is deployed as a scalablepixel array as a CMOS chip, a nano-scale lithography process is used tofabricate the nano-electrodes, and molecular self-assembly chemicalreactions, in solution, are used to establish the molecular complex oneach nano-electrode in the sensor array. The result of this fabricationstack is the finished DNA reader sensor array chip indicated at thebottom of FIG. 30A. In various embodiments, the nanoscale lithography isdone using a high resolution CMOS node, such as a 28 nm, 22 nm, 20 nm,16 nm, 14 nm, 10 nm, 7 nm or 5 nm nodes, to leverage the economics ofCMOS chip manufacturing. In contrast, the pixel electronics may be doneat a coarser node better suited to mixed signal devices, such as 180 nm,130 nm, 90 nm, 65 nm, 40 nm, 32 nm or 28 nm. Alternatively, thenano-electrodes may be fabricated by any one of a variety of otherfabrication methods known to those skilled in the art ofnanofabrication, such as e-beam lithography, nano-imprint lithography,ion beam lithography, or advanced methods of photolithography, such asany combinations of Extreme UV or Deep UV lithography, multiplepatterning, or phase shifting masks.

FIG. 30B illustrates an embodiment of a high-level CMOS chip pixel arrayarchitecture for a DNA sequencing chip (at the left side of the figure),comprising a scalable array of sensor pixels, with associated power andcontrol circuitry and major blocks such as Bias, Analog-to-Digitalconvertors, and timing. The inset in the figure shows an individualsensor pixel as a small bridged structure and where this individualelectronic sensor is located in the array of sensor pixels. FIG. 30Balso illustrates (at the right side of the figure) the details of anembodiment of a molecular electronics sensor circuit pixel in the array.As illustrated in FIG. 30B, the sensor pixel comprises an amplifier, areset switch, and circuitry for supplying the source, gate and drainvoltages, and readout of results. In various embodiments, the singlepixel circuitry comprises a trans-impedance current amplifier, avoltage-biasable source, reset switches. FIG. 31 shows an embodiment ofa circuit schematic of the pixel amplifier in detail at the left side ofthe figure, along with simulation results at the right side of thefigure showing the voltage signal vs time when used to measure a 10 pAcurrent, and with a reset applied periodically as indicated in the plot.This embodiment exemplifies one non-limiting selection of circuitcomponents and parameters (transistor, resistors, capacitors, etc.).

FIG. 32 illustrates an embodiment of an annotated chip design layoutfile and the corresponding finished chip for comparison. At the left ofFIG. 32 is an annotated image of the rendered layout (GDS) file for thechip design comprising the CMOS pixel array of FIG. 30B with 256 pixels,and annotated to show the location of the Bias, Array and Decoderregions of the chip. At the right of FIG. 32 is an optical microscopeimage of the corresponding finished chip based on the final design,produced at TSMC, Inc. semiconductor foundry (San Jose, Calif.) with theTSMC 180 nm CMOS process, with no passivation layer.

FIG. 33 shows SEM images of the finished CMOS chip of FIG. 32, withhigh-resolution images further showing an 80 μm pixel for a readersystem comprising nanoelectrodes and a polymerase molecular complex inplace between the electrodes. At the left of FIG. 33 is a SEM image ofthe CMOS chip, with no top passivation layer and exposed planarizedmetal 6 layer. The middle higher resolution image clearly showing thesub-optical surface features of an 80 μm pixel, and notably the exposedvias (source, gate, and drain) where the nanoelectrodes are to bedeposited by post-CMOS nanofabrication processing steps and electricallyconnected into the amplifier circuit, as shown in the right portion ofFIG. 30B. The furthest right SEM image in FIG. 33 shows an e-beamlithography fabricated pair of spaced apart nanoelectrodes with amolecular complex in place. The sketch at the bottom right of FIG. 33 isan illustration of the molecular electronics sensor comprising apolymerase molecular complex, which is labeled by the gold dot.

In various embodiments of a DNA reader device, use of a CMOS chip devicein conjunction with nano-scale manufacturing technologies, ultimatelyyield a much low cost, high throughput, fast, and scalable system. Forexample, sensors such as this can process DNA templates at the rate of10 or more bases per second, 100 or more times faster than currentoptical sequencers. The use of CMOS chip technology ensures scalabilityand low system cost in a mass-producible format that leverages theenormous infrastructure of the semiconductor industry. As noted,whatever error modes or accuracy limitations may exist in a DNA sensor,or that may arise at faster reading speed (e.g. by modifying the enzymeor buffer or temperature or environmental factors, or sample data atlower time resolution) can be compensated for in the overallencoder/decoder-reader-writer framework described.

In various embodiments of the present disclosure, a DNA reader chip foruse herein comprises at least 1 million sensors, at least 10 millionsensors, at least 100 million sensors, or at least 1 billion sensors.Recognizing that at a typical sensor data sampling rate of 10 kHz, andrecording 1 byte per measurement, a 100 million sensor chip produces rawsignal data at a rate of 1 Terabyte (TB) per second. In considering howmany sensors are desirable on a single chip, one critical considerationis the rate at which such a chip can decode digital data stored in DNAcompared to the desirable digital data reading rates. It is, forexample, desirable to have digital data read out at a rate of up toabout 1 Gigabyte per second. Note that each bit of digital data encodedas DNA will require multiple signal measurements to recover, given thata feature of the signal use used to store this information, so this rawsignal data production rate for the measured signal will be much higherthat the recovery rate of encoded digital data. For example, if 10signal measurements are required to recover 1 bit of stored digitaldata, as might be the case for signal features such as in FIG. 11, andeach measurement is an 8-bit byte, that is a factor of 80 bits of signaldata to recover 1 bit of stored digital data. Thus, digital data readingrates are anticipated to be on the order of 100 times slower than thesensor raw signal data acquisition rate. For this reason, achievingdesirable digital data reading rate of 1 Gigabyte/sec would requirenearly 0.1 TB/sec of usable raw signal data. Further, given that not allthe sensors in a single chip may be producing usable data, the need forchips that produce up to 1 TB/sec of raw data is desirable, based on thedesired ultimate digital data recover rates from data stored as DNA. Invarious embodiments, such recovery rates correspond to a 100 millionsensor pixel chip.

In various embodiments of the present disclosure, multiple chips aredeployed within a reader system to achieve desired system-level digitaldata reading rates. The DNA data reader chip of FIG. 30A is, in variousembodiments, deployed as part of a complete system for reading digitaldata stored in DNA. The features of an embodiment of a complete systemare illustrated in FIG. 34. In various aspects, and with reference toFIG. 34, a complete digital data reading system comprises a motherboardwith a staging area for an array of multiple chips, in order to providedata reading throughput beyond that of the limitations of a single chip.Such chips are individually housed in flow cells, with a fluidics liquidhandling system that controls the addition and removal of the sensorsystem liquid reagents. In addition, the fluidics system receives DNAencoding data in solution form, originating from a data repositorysource. In various aspects, the motherboard further comprises a suitablefirst stage data processing unit capable of receiving and reducing rawsignal data at very high rates, e.g., exceeding 1 TB/sec, exceeding 10TB/sec, or exceeding 100 TB/sec, indicated as a primary signalprocessor. This primary processor may comprise one, multiple, orcombinations of a FPGA, GPU, or DSP device, or a custom signalprocessing chip, and this may optionally be followed by stages ofsimilar such signal processors for a processing pipeline. Data output ofthis primary pipeline is typically transferred to a fast data storagebuffer, such as a solid state drive, with data from here undergoingfurther processing or decoding in a CPU-based sub-system, from whichdata is buffered into a lower speed mass storage buffer, such as a harddrive or solid state drive or array of such drives. From there it istransferred to an auxiliary data transfer computer sub-system thathandles the subsequent transfer of decoded data to a destination. Allthese system operations are under the high-level control of an auxiliarycontrol computer that monitors, coordinates and controls the interplayof these functional units and processes.

In various embodiments, chips within the reader system may be disposableand replaced after a certain duty cycle, such as 24 hours to 48 hours.In other embodiments, the chips may be reconditioned in place after sucha usage period, whereby the molecular complex, and possibly conjugatinggroups, are removed, and then replaced with new such components througha serious of chemical solution exposures. The removal process maycomprise using voltages applied to the electrodes to drive removal, suchas an elevated voltages applied to the electrodes, an alternatingvoltage applied to the electrodes, or a voltage sweep. The process mayalso comprise the use of chemicals that denature, dissolve or dissociateor otherwise eliminate such groups, such as high molarity urea, orguanidine or other chaotropic agents, proteases such as Proteinase K,acids such as HCl, bases such as KOH or NaOH, or other agents well knownin molecular biology and biochemistry for such proposes. This processmay also include the use of applied temperature or light to drive theremoval, such as elevated temperature or light in conjunction withphoto-cleavable groups in the molecular complex or conjugation groups.

FIG. 35 illustrates an embodiment of a cloud based DNA data archivalstorage system, in which the complete reader system, such as exemplifiedin FIG. 34, is, in certain embodiments, deployed in aggregate format toprovide the cloud DNA reader server of the overall archival storage andretrieval system. The system of FIG. 35 comprises a cloud computersystem, with a standard storage format (depicted at the upper left ofthe figure). Such as standard cloud computer system comprises a DNAarchival data storage capability as indicated. In various aspects, acloud-based DNA synthesis system can accept binary data from the cloudcomputer and produce the physical data encoding DNA molecules. Thisserver stores the output molecules in a DNA data storage archive(depicted at the lower right of the figure) wherein the physical DNAmolecules that encode data are stored in a dried or lyophilized form, orin solution, at ambient temperature, cooled temperature, or frozen. Whendata is to be retrieved, a sample of the DNA from the archive isprovided to the DNA data reader server, which outputs decoded binarydata back to the primary cloud computer system. This DNA data readerserver is, in certain embodiments, powered by a multiplicity of DNAreader chip-based systems, such as indicated in FIG. 34, in combinationwith additional computers that perform the final decoding of the DNAderived data back to the original data format of the primary cloudstorage system.

In various embodiments, a molecular electronics sensor comprises theconfiguration illustrated in FIG. 36. In this case, fundamentalelectronic measurements are made by a nanopore ionic current sensor thatconsists of electrodes on either side of a membrane, a pore localized inthe membrane, and an aqueous solution phase residing on both sides ofthe pore. In this embodiment, the pore regulates the passage of ioniccurrent (indicated by the dashed arrow and the “i”). The pore maycomprise a biological protein nanopore, native or mutated, and themembrane may comprise a lipid membrane, or synthetic analogue thereof.The pore may also comprise a solid state pore, with the membranecomprising a thinned membrane composed of a solid material such as SiN,or Teflon. The pore may have electrodes of the same polarity, or, asillustrated, opposite polarity. As shown in FIG. 36, the polymerasemolecule is further complexed with the pore, as part of a molecularcomplex involving a small number of molecules embedded through themembrane as part of the pore and to provide a conjugation to thepolymerase. As the polymerase processes a DNA template, the ioniccurrent through the pore is modulated by this activity, producingdistinguishable signal features that correspond to distinct sequencefeatures. Aside from a different geometry of the nano-electricalmeasurement, the considerations are otherwise identical to those alreadyreviewed herein. That is, nano-pore current sensor versions of thepolymerase-based DNS digital data reader are of similar use herein.

In various embodiments, a molecular electronics sensor comprises theconfiguration shown in FIG. 37, wherein the polymerase is directly andspecifically conjugated to the pore, and wherein modified dNTPs are usedto produce distinguishable signals from DNA sequence features, which isa situation comparable to that of FIG. 23 for the nano-electrode sensor.For producing signals in a nanopore sensor, such dNTP modifications maycomprise groups on the γ-phosphate of the dNTP, which can occlude thepore while the dNTP is undergoing incorporation by the polymerase,thereby resulting in current suppression features. In variousembodiments, such modifications comprise extending the tri-phosphatechain to 4, 5, 6 or up to 12 phosphates, and adding terminal phosphategroups, or groups to any of phosphates at position 2 or more, which areremoved by polymerase incorporation, such groups including polymers thatmay occlude the pore by entering pore, such as comprising PEG polymersor DNA polymers. The polymerase conjugation to the pore may comprise anyone of possible conjugation chemistries, such as a molecular tether, orSpy-SpyCatcher protein-based conjugation system, or the like. For thenanopore sensor embodiments indicated in FIG. 36 and FIG. 37, all theaspects of the disclosure put forth above in the context of FIG. 10Balso apply in this instance, to provide a nanopore ion currentsensor-based sensor for reading digital data stored in DNA molecules,and the related beneficial aspects, encoding schemes, chip formats,systems and cloud based DNA digital data storage systems.

One embodiment of the molecular electronic sensor of FIG. 10B,illustrated conceptually in FIG. 15, comprises a carbon nanotube as thebridge molecule (represented by the bold horizontal bar in FIG. 15bridging the gap between positive and negative electrodes). Thisembodiment is illustrated in FIG. 38. In various aspects, the carbonnanotube bridge comprises a single or multi-walled carbon nanotube, andis conjugated to the polymerase molecule at a specific site using any ofmany possible conjugation chemistries. Such a conjugation may, forexample, comprise a pyrene linker to attach to the nanotube viait-stacking of the pyrene on the nanotube, or may comprise attachment toa defect site residing in the carbon nanotube. In this case, the currentpassing through a carbon nanotube molecular wire is known to be a highlysensitive other molecules in the surrounding environment, such asindicated in FIG. 10A. It is further known that current passing througha carbon nanotube is sensitive to the activity of an enzyme moleculeproperly conjugated to that nanotube, including polymerase enzymes. Forthis particular embodiment, all the aspects of the present disclosureput forth above apply in this instance, to provide a carbon nanotubebased sensor for reading digital data stored in DNA molecules, includingthe related beneficial aspects, encoding schemes, chip formats, systemsand cloud based DNA digital data storage systems.

An alternative sensor that produces optical signals is a Zero ModeWaveguide sensor, such as the sensor illustrated in FIG. 39. Such asensor may comprise a single polymerase as shown, conjugated to thebottom of the metallic well, in the evanescent zone of the excitationfield applied to the thin substrate, in a Total Internal Reflectionmode. The polymerase is provided with primed template and dNTPs with dyelabels on the cleavable phosphate group. When such a dNTP isincorporated, the dye label is held in the evanescent field, and isstimulated to emit photons of the corresponding dye energy spectrum orcolor. The result is that, under appropriate conditions, such a sensormay produce distinguishable optical signals as indicated, which can beused to encode digital information into DNA molecules. Thedistinguishable signals here may be photon emissions of a differentenergy distribution, or color, or emissions with differentdistinguishable spectra, or different duration or intensity or shape ofthe spectra versus time, or any combination of such elements that resultin distinguishable features. For this Zero Mode Waveguide sensorembodiment indicated in FIG. 39, all the aspects of the disclosure putforth above in the context of FIG. 10B also apply in this instance, toprovide a Zero Mode Waveguide-based sensor for reading digital datastored in DNA molecules, and the related beneficial aspects, encodingschemes, chip formats (in this case, optical sensor chips, such as imagesensor chips), systems and cloud based DNA digital data storage systemsmay apply to such a sensor.

Optimization of the DNA Writing Device

In various embodiments, the DNA information storage system of thepresent disclosure further comprises a DNA writing device capable ofwriting a large number of DNA sequences in parallel as synthesizedmolecules, with each desired sequence embodied in multiple synthesizedmolecules, and the rate of synthesis, or time per base, as fast aspossible such that the overall rate of writing DNA information is fastenough, and at high enough volume, for practical use in large scalearchival information storage.

Current commercial DNA synthesis based on the classical phosphoramiditechemistry cycle is relatively slow, requiring on the order of 30 minutesper base addition. The bulk of the 30 minute base addition cycle is theacid-mediated deprotection of the 5′-OH on the distal end of theextending oligonucleotide chain. The prolonged exposure of the incipientsequences to these acid conditions also creates a major source ofsequence error via de-purination. This method also suffers fromrelatively low parallelism, being performed in 384 well-plates on anexpensive instrument. The process is also limited to making sequences ofat most several hundred bases in length due to efficiency yieldlimitations in a stepwise synthesis. Therefore, this method is bestsuited to make larger quantities of each of a small number (1 tothousands) of relatively short (<˜200 base) DNA sequences.

Higher throughput commercial DNA synthesis systems have been developedto support the in-situ synthesis of DNA microarrays. Such systems ineffect print a large array of micro-spots of in-situ synthesized DNA,adding one base at a time in a highly parallel way across the spots. Forexample, the Agilent ink-jet-based DNA oligo array printer can print anarray of up to 1 million DNA spots on a glass microscope slide, whereeach spot is on the order of 20 microns in diameter, with a 30-micronspot-to-spot pitch for the rectangular array of spots. The synthesisreaction is still relatively slow, on the order of 1 base per hour, andthe DNA length is even more limited than classical well-plate synthesis,to practical lengths of up to ˜100 bases. Nonetheless, systems such asthis can synthesize a total of ˜100 million letters of DNA sequence (˜25MB), in several days, at a cost of several hundred dollars for thefinished array—although the writing instrument has a high capital costand complexity such that it has never been commercialized, andproduction of such DNA arrays is done in a centralized factory formatwith limited capacity. Furthermore, future upscaling of this technologyin terms of spots/array may be at the asymptotic limit already, giventhat it leverages existing ink-jet technology which may itself havereached its asymptotic limit.

Thus, for large scale archival DNA storage, substantially lower costs,faster speed, and lower capital cost of the DNA writing device arehighly desirable. In particular, the storage writing cost is still near$10 per MB, which is far above the estimated $0.02 per GB (500,000 foldmore) for magnetic tape storage writing. Thus the costs of writing DNAneed to come down dramatically to make common long term storageapplications practical, preferably by several orders of magnitude, andpreferably by 1,000,000 fold.

To achieve this goal, and in certain embodiments of the presentdisclosure, a DNA writing system for use herein comprises a CMOS-chipbased array of actuators for DNA synthesis. The DNA writer consists of aCMOS chip, with an array of actuator pixels that direct actuatorspecific voltage/current or light mediated 5′-OH deprotection, wherebynovel voltage or light sensitive protecting groups enable fasterdeprotection kinetics and no de-purination errors. In various aspects,the chip includes millions up to billions of actuator pixels. Suchpixels may comprise a nanoelectrode and/or selectable light source,around which DNA synthesis takes place. Voltage applied to thiselectrode, or current sourced to it, or localized light actuation wouldcontrol each 5′-OH deprotection reaction, as a series of A, C, G, T, . .. cyclical addition reactions take place globally across the chip, andwherein each pixel controls the addition or not of the supplied baseduring each cycle via voltage or current or light.

The use of CMOS chip scaling supports the ability to ultimately providebillions of such synthesis sites on a standard, low cost, mass-producedchip. Localized voltage/light actuation can also be used to acceleratethe synthesis chemistry and shorten the cycle time, such as from ˜30minutes down to seconds. The actuator electrodes may be derivatized withchemical layers that transduce voltage or current to other usefulelectrochemical local environment changes, such as, for example, toprovide for voltage-generated acids as the means to modulatenon-classical phosphoramidite synthesis chemistry. In other aspects, avoltage/current may modulate a conformational or steric or mechanicalchange of local polymer/molecular matrix structure in which thesynthesis takes place, that physically impedes or allows the baseaddition. In particular, one embodiment of such a system could havemicro- or nano-wells or containers at each site, which contain thegrowing DNA oligos and which can be actuated to open/close to physicallyselectively control the base addition reactions. The added bases mayalso contain charge or other modifications that facilitates the use ofvoltage or light to direct and control the process. Through CMOS chipscaling and voltage or current or light-directed synthesis augmentationof a phosphoramidite synthesis cycle, there is the potential for largeincreases in scale, and reductions in cost of the process and instrumentused to perform synthesis. The finished DNA fragments, consisting ofmultiple exemplars of each target sequence for a given pixel as eachsite, can be released from the support post-synthesis, and pooled insolution to form the physical archive.

In the context of this and other DNA writer embodiments, selection ofthe encoding/decoding algorithm may minimize system costs, especiallytime costs. For example, dephosphorylating is a slow process, and someof the bases and sequences (e.g., purine versus amine, homogenous runsof G and C) are more difficult to synthesis due to chemical or secondarystructure effect. This presents the option to not drive thedephosphorylating to completion, to save time, at the cost of moreerror, and also avoid the use of certain base compositions in theencoded sequence (e.g. do not use purines, or do not use runs of G,etc., in the encoding) to allow faster chemical processing without majoradded error burden. In this way the synthesis reaction can beaccelerated, and the encoding/decoding algorithm can compensate in termsof the error corrective or error avoiding encoding. Aside from avoidinghigh error sequence modes, standard error correcting code algorithms cancorrect extremely high rates of error in the DNA sequence, even extremeerror rates of up to 50% or more. In various aspects, theencoding/decoding is co-optimized with the properties of the DNA readerand DNA writer, so as to optimize overall system performance, and/or toreduce overall system cost by some cost measure of interest such as timeof financial cost.

Optimization of the DNA Storage Archive Operations

In certain aspects, the DNA information storage system further comprisesa DNA storage archive. In various embodiments of the DNA informationstorage system, novel ways are provided to achieve desirable operationsrelated to managing storage archives. In various embodiments of thepresent system, the following types of operations may be performed:

Create a copy of the archive;

Append data to the archive;

Readout a targeted volume from the archive;

Delete a volume from the archive; or

Search the Archive.

In various embodiments, a DNA archive comprises a pool of DNA molecules,with each desired DNA sequence represented by a number of molecularexemplars. This pool of DNA molecules may be stored in a dry state, orin solution phase such as maintained at low temperature or frozen instorage. In certain examples, the archive can temporarily be brought upto working temperatures in a compatible buffer solution to perform theseoperations. These operations would be performed efficiently by thephysical storage system, which may include freezers, refrigerators, andautomation for handling of tubes, liquid handling, performing reactions,and the other procedures related to maintaining and manipulating thephysical archive material.

In various embodiments, storage-related operations in a DNA storagearchive can be achieved as follows:

Copying: Copying an archive may comprise taking an aliquot of a stocksolution, or, for copy without depletion, by including in the molecularencoding amplification primer sites, and priming and amplifying, inlinear or exponential amplification reactions, the archive prior totaking an aliquot as a copy.

Appending: Appending data to the archive or merging archives can beachieved by pooling in and mixing with the additional DNA or archivematerial.

Targeted Reading/Deleting: Working with individual “volumes” within anarchive can be performed by encoding into the DNA moleculessequence-specific oligo binding sites, with a differentidentifier/binding sequence for each volume to be made so accessible.Then, to readout a specific volume, hybridization-based capture could beused to select out just DNA fragments with desired binding sequences.Deleting of a volume could be performed by a subtractive-hybridizationprocesses to remove all DNA fragments with a given binding sequence. Inanother embodiment, the deletion could be performed by usingoligo-directed DNA cleavage/degradation, in particular enzymatic methodsthat use Cleavase, DICER or CRISPER for oligo-binding directeddestruction of the targeted molecules. In yet another embodiment, primerbinding followed by a synthesis or ligation reaction may be used toincorporate bases or oligos that allow selective destruction or removal,such as through biotinylated elements removed by a streptavidin column.Volume identifiers could also be added by synthesizing DNA withnucleotide modifications, so the relevant binding targets are not viaDNA-sequence specific hybridization per se, but in other modificationson the bases used in the synthesis. For example, use of biotinylatedbases, or bases with various hapten modifications, etc., similarlyprovide selective ability to bind or manipulate subsets of the DNA viathe corresponding interaction partners for these modificationsintentionally introduced in the synthesis.

Searching: Search of an archive for a literal input string can beachieved by encoding the search string or strings of interest into DNAform, synthesizing a complementary form or related primers for thedesired DNA sequences, and using hybridization or PCR amplification toassay the archive for the presence of these desired sequence fragments,according to such standard assays are used by those skilled in the artof molecular biology to ascertain the presence of a sequence segment ina complex pool of DNA fragments. The search could report either presenceor absence, or could recover the associated fragments containing thesearch string for complete reading.

1. An information storage system comprising: a writing device thatsynthesizes a nucleotide sequence that encodes a set of information; anda reading device that interprets the nucleotide sequence by decoding theinterpreted nucleotide sequence into the set of information, wherein thereading device comprises a molecular electronics sensor, the sensorcomprising a pair of spaced apart electrodes and a polymerase enzymemolecule coupled to the electrodes in a molecular electronics circuit,and wherein the molecular electronics sensor produces distinguishablesignals in a measurable electrical parameter of the molecularelectronics sensor, when interpreting the nucleotide sequence.
 2. Theinformation storage system of claim 1, wherein the polymerase enzymemolecule is coupled to each electrode by a single bridge moleculeattached to and connecting the electrodes, wherein the polymerase enzymeis conjugated to the bridge molecule.
 3. The information storage systemof claim 1, wherein the polymerase enzyme molecule is coupled to eachelectrode by two arm molecules, one attached to each electrode, whereinthe arm molecules are conjugated to two distinct sites on the polymeraseenzyme.
 4. The information storage system of claim 1, wherein thepolymerase enzyme molecule is directly conjugated to the electrodes. 5.The information storage system of claim 1, wherein the set ofinformation is binary.
 6. The information storage system of claim 1,wherein the nucleotide sequence comprises a DNA sequence.
 7. Theinformation storage system of claim 6, further comprising at least oneof error detecting schemes or error correction schemes for minimizingerrors within the DNA sequence.
 8. The information storage system ofclaim 1, wherein the writing device comprises a CMOS chip based array ofactuator pixels for DNA synthesis, the actuator pixels directingvoltage/current or light-mediated deprotection with a DNA synthesisreaction comprising phosphoramidite or ligation chemistries.
 9. Theinformation storage system of claim 1, wherein the measurable electricalparameter of the sensor is modulated by enzymatic activity of thepolymerase enzyme molecule.
 10. The information storage system of claim1, wherein the molecular electronics sensor is part of a CMOS sensorarray chip further comprising a plurality of molecular electronicssensors and supporting pixel circuitry that performs measurements of themeasurable electrical parameter.
 11. The information storage system ofclaim 1, wherein the molecular electronics sensor further comprises agate electrode adjacent the spaced apart electrodes.
 12. A method ofinterpreting a set of information encoded in a nucleotide sequence of aDNA molecule, the method comprising: supplying the DNA molecule to amolecular electronics sensor capable of producing distinguishablesignals in a measurable electrical parameter of the molecularelectronics sensor relating to the set of information; generating thedistinguishable signals; and converting the distinguishable signals intothe set of information, wherein the molecular electronics sensorcomprises a pair of spaced apart electrodes and a polymerase enzymemolecule coupled to the electrodes in a molecular electronics circuit,and wherein the measurable electrical parameter of the molecularelectronics sensor is modulated by enzymatic activity of the polymeraseenzyme molecule.
 13. The method of claim 12, wherein the set ofinformation is binary.
 14. The method of claim 12, wherein thepolymerase enzyme molecule is coupled to each electrode by a singlebridge molecule attached to and connecting the electrodes, wherein thepolymerase enzyme is conjugated to the bridge molecule.
 15. The methodof claim 12, wherein the polymerase enzyme molecule is coupled to eachelectrode by two arm molecules, one attached to each electrode, whereinthe arm molecules are conjugated to two distinct sites on the polymeraseenzyme.
 16. The method of claim 12, wherein the polymerase enzymemolecule is directly conjugated to the electrodes.
 17. A method ofarchiving and retrieving a set of information, the method comprising:converting the set of information through an encoding scheme into anucleotide sequence that encodes the set of information; synthesizing aDNA molecule comprising the nucleotide sequence; exposing the DNAmolecule to a molecular electronics sensor capable of producingdistinguishable signals in a measurable electrical parameter of themolecular electronics sensor relating to the set of information; andconverting the distinguishable signals into the set of information,wherein the molecular electronics sensor comprises a pair of spacedapart electrodes and a polymerase enzyme molecule coupled to theelectrodes in a molecular electronics circuit, and wherein themeasurable electrical parameter of the molecular electronics sensor ismodulated by enzymatic activity of the polymerase enzyme molecule as thepolymerase enzyme molecule processes the DNA molecule.
 18. The method ofclaim 17, wherein the polymerase enzyme molecule is coupled to eachelectrode by a single bridge molecule attached to and connecting theelectrodes, wherein the polymerase enzyme is conjugated to the bridgemolecule.
 19. The method of claim 17, wherein the set of information isbinary.
 20. The method of claim 17, wherein the encoding schemecomprises any one or combination of binary encoding schemes BES1, BES2,BES3, BES4, BES5 and BES6.